how to utilize Pandas aggregate functions on this DataFrame?

how to utilize Pandas aggregate functions on this DataFrame? - python

This is the table:
order_id product_id reordered department_id
2 33120 1 16
2 28985 1 4
2 9327 0 13
2 45918 1 13
3 17668 1 16
3 46667 1 4
3 17461 1 12
3 32665 1 3
4 46842 0 3
I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this:
department_id number_of_orders number_of_reordered_0
3 2 1
4 2 0
12 1 0
13 2 1
16 2 0
I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work?
I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.

Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1):
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())])
.reset_index())
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
If values are only 1 and 0 is possible use sum and last subtract:
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0','sum')])
.reset_index())
df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0']
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0

in sql it would be simple aggregation
select department_id,count(*) as number_of_orders,
sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0
from tabl_name
group by department_id

Related

Logical AND of multiple columns in pandas

I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function

You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64

Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31

how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.

Determine size within each each group having the same value in another column

I have dataframe like so,
ID,CLASS_ID,ACTIVE
1,123,0
2,123,0
3,456,1
4,123,0
5,456,1
11,123,1
18,123,0
7,456,0
19,123,0
8,456,1
I'm trying to get the cumulative counts of the CLASS_ID having same value for ACTIVE. In case of the dataframe given above, CLASS_ID is continuously having ACTIVE as 0, until the 4th record post which next value is 1. So up until 4th record, count should be 3. This process has to be continued and the count has to be resetted every time value of ACTIVE changes for the CLASS_ID The expected output is as follows..
ID,CLASS_ID,ACTIVE,ACTIVE_COUNT
1,123,0,3
2,123,0,3
3,456,1,2
4,123,0,3
5,456,1,2
11,123,1,1
18,123,0,2
7,456,0,1
19,123,0,2
8,456,1,1
I tried using df.groupby(..).transform(..) but its not working out for me. Could someone help me out a bit?

You can do this with groupby:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
df['ACTIVE_COUNT'] = df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
df
ID CLASS_ID ACTIVE ACTIVE_COUNT
0 1 123 0 3
1 2 123 0 3
2 3 456 1 2
3 4 123 0 3
4 5 456 1 2
5 11 123 1 1
6 18 123 0 2
7 7 456 0 1
8 19 123 0 2
9 8 456 1 1
Details
First, create an indicator column marking rows with the same value per group:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
ind
0 1
1 1
2 1
3 1
4 1
5 2
6 3
7 2
8 3
9 3
Name: ACTIVE, dtype: int64
We then use ind as a grouper argument to df.groupby along with "CLASS_ID", and then compute the size of each group using transform.
df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
0 3
1 3
2 2
3 3
4 2
5 1
6 2
7 1
8 2
9 1
Name: ACTIVE, dtype: int64

Pandas DataFrame joining 2 tables on <,> conditions

I would like to remove all sessions after user conversion (and also removing the sessions that happened on the day of conversion)
full_sessions = pd.DataFrame(data={'user_id':[1,1,2,3,3], 'visit_no':[1,2,1,1,2], 'date':['20180307','20180308','20180307','20180308','20180308'], 'result':[0,1,1,0,0]})
print full_sessions
date result user_id visit_no
0 20180307 0 1 1
1 20180308 1 1 2
2 20180307 1 2 1
3 20180308 0 3 1
4 20180308 0 3 2
When did people convert?
conversion = full_sessions[full_sessions['result'] == 1][['user_id','date']]
print conversion
user_id date
0 1 20180308
2 2 20180307
Ideal output:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
What do I want in SQL?
SQL would be:
SELECT * FROM (
SELECT * FROM full_sessions
LEFT JOIN conversion
ON
full_sessions.user_id = conversion.user_id AND full_sessions.date < conversion.date
UNION ALL
SELECT * FROM full_sessions
WHERE user_id NOT IN (SELECT user_id FROM conversion)
)

IIUC using merge in pandas
full_sessions.merge(conversion,on='user_id',how='left').loc[lambda x : (x.date_y>x.date_x)|(x.date_y.isnull())].dropna(1)
Out[397]:
date_x result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2

You can join the dataframes and then filter the rows matching your criteria this way:
df_join = full_sessions.join(conversion,lsuffix='',
rsuffix='_right',how='left',on='user_id')
print(df_join)
date result user_id visit_no user_id_right date_right
0 20180307 0 1 1 1.0 20180308
1 20180308 1 1 2 1.0 20180308
2 20180307 1 2 1 2.0 20180307
3 20180308 0 3 1 NaN NaN
4 20180308 0 3 2 NaN NaN
And then just keep those with NaN in the right date or with date_right smaller than date:
>>> df_join[df_join.apply(lambda x: x.date < x.date_right
if pd.isna(x.date_right) is False
else True,axis=1)][['date','visit_no','user_id']]
date visit_no user_id
0 20180307 1 1
3 20180308 1 3
4 20180308 2 3

Here is a method which maps a series instead of join / merge alternatives.
fs['date'] = pd.to_numeric(fs['date'])
s = fs[fs['result'] == 1].set_index('user_id')['date']
result = fs.loc[fs['date'] < fs['user_id'].map(s).fillna(fs['date'].max()+1)]
Result
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
Explanation
Create a mapping from user_id to conversion date, store it in a series s.
Then just filter on dates prior to conversion dates mapped via user_id.
If no conversion date, then data will be included since we fillna with a maximal date.
Consider using datetime objects. I have converted to numeric above for simplicity.

using groupby & apply & some final cleanup with reset index, you can express it in 1 very long statement:
full_sessions.groupby('user_id', as_index=False).apply(
lambda x: x[:(x.result==1).values.argmax()] if any(x.result==1) else x
).reset_index(level=0, drop=True)
outputs:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2

creating dataframe efficiently without for loop

I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!

A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary

Resample pandas dataframe only knowing result measurement count

I have a dataframe which looks like this:
Trial Measurement Data
0 0 12
1 4
2 12
1 0 12
1 12
2 0 12
1 12
2 NaN
3 12
I want to resample my data so that every trial has just two measurements
So I want to turn it into something like this:
Trial Measurement Data
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12
This rather uncommon task stems from the fact that my data has an intentional jitter on the part of the stimulus presentation.
I know pandas has a resample function, but I have no idea how to apply it to my second-level index while keeping the data in discrete categories based on the first-level index :(
Also, I wanted to iterate, over my first-level indices, but apparently
for sub_df in np.arange(len(df['Trial'].max()))
Won't work because since 'Trial' is an index pandas can't find it.

Well, it's not the prettiest I've ever seen, but from a frame looking like
>>> df
Trial Measurement Data
0 0 0 12
1 0 1 4
2 0 2 12
3 1 0 12
4 1 1 12
5 2 0 12
6 2 1 12
7 2 2 NaN
8 2 3 12
then we can manually build the two "average-like" objects and then use pd.melt to reshape the output:
avg = df.groupby("Trial")["Data"].agg({0: lambda x: x.head((len(x)+1)//2).mean(),
1: lambda x: x.tail((len(x)+1)//2).mean()})
result = pd.melt(avg.reset_index(), "Trial", var_name="Measurement", value_name="Data")
result = result.sort("Trial").set_index(["Trial", "Measurement"])
which produces
>>> result
Data
Trial Measurement
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to utilize Pandas aggregate functions on this DataFrame? - python

in sql it would be simple aggregation select department_id,count(*) as number_of_orders, sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0 from tabl_name group by department_id

Related

Logical AND of multiple columns in pandas

Determine size within each each group having the same value in another column

Pandas DataFrame joining 2 tables on <,> conditions

creating dataframe efficiently without for loop

Resample pandas dataframe only knowing result measurement count

Categories

Resources