I would like to count all product_id depending on following condition:
shared_product==1
exclusive_product_storeA ==1
exclusive_product_storeB ==1
Main df
date product_id shared_product exclusive_product_storeA exclusive_product_storeB
2019-01-01 34434 1 0 0
2019-01-01 43546 1 0 0
2019-01-01 53288 1 0 0
2019-01-01 23444 0 1 0
2019-01-01 25344 0 1 0
2019-01-01 42344 0 0 1
Output DF
date count_shared_product count_exclusive_product_storeA count_exclusive_product_storeB
2019-01-01 3 2 1
This is what I have tried - but this does not give me the desired output df:
df.pivot_table(index=['shared_product','exclusive_product_storeA','exclusive_product_storeB'],aggfunc=['count'],values='product_id')
The idea here is to exclude rows that have a value of 0, groupby date and the resulting column, and finally unstack to get your final result
(
df.drop("product_id", axis=1)
.set_index("date")
.stack()
.loc[lambda x: x == 1]
.groupby(level=[0, 1])
.sum()
.unstack()
.rename_axis(index=None)
)
exclusive_product_storeA exclusive_product_storeB shared_product
2019-01-01 2 1 3
A shorter path would be to exclude the product_id, groupby date and sum the columns :
df.drop("product_id", axis=1).groupby("date").sum().rename_axis(None)
Related
Given that I have a pandas dataframe:
waterflow_id created_at
0 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-20 13:19:21.430816+00:00
1 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819+00:00
2 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819+00:00
3 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821+00:00
4 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821+00:00
How do I get the median of days between created_at so that I can have a dataframe of days in between waterflow ids having something like:
waterflow days_median
1 0
2 4
3 6
4 7
5 10
Basically here, waterflow represents the unique occurrence of waterflow_id's
With the latest answer I tried
meddata = waterflow_df.groupby("waterflow_id")['created_at'].apply(lambda s: s.diff().median())
print(meddata)
And I recieved:
waterflow_id
0788a658-06d9-4b61-9ac4-2728ace02a86 0 days
1f8752f8-f667-44ec-84b9-acad02d384c0 0 days
2655b525-8b2c-4a53-abdc-5208cb95d96e 0 days
8d3cd7e3-900c-4996-b202-f66eb41ac37b 0 days
9d02b939-f295-4d36-8f72-e9984a52dbd9 0 days
d8d8fb70-d755-48c3-8c19-8032864719da 0 days
dc1da5e1-6974-4145-a0d8-615e08506ebf 0 days
f39366f5-c9e2-415a-baec-530bb8bd2f07 0 days
Whats weird is that I have dates spanning up to 6 months.
The output is unclear, but IIUC, you could use a GroupBy.agg:
from itertools import count
c = count(1)
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', lambda s: next(c)),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
or using factorize to number the groups:
df['created_at'] = pd.to_datetime(df['created_at'])
(df.assign(waterflow_id=df['waterflow_id'].factorize()[0]+1)
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', 'first'),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
output:
waterflow days_median
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 1 0.0
Simple version with just the median:
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df.groupby('waterflow_id')['created_at']
.apply(lambda s: s.diff().median()
.total_seconds()//(3600*24))
)
output:
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 0.0
Name: created_at, dtype: float64
Below is the sample dataframe
import pandas as pd
import numpy as np
from datetime import datetime
start = datetime(2011, 1, 1)
end = datetime(2012, 1, 1)
index = pd.date_range(start, end)
df = pd.DataFrame({"Trade Days": 0}, index=index)
df.iloc[0,:]=2
df.iloc[5,:]=3
As you notice, 'Trade Days' column has 2 on '2011-01-01' and has 3 on '2011-01-06'. I want to create another column filled with 1s based on count value in 'Trade Days' column. Sample output column is as follows:-
df['open position']=0
df.iloc[0:2,1]=1
df.iloc[5:8,1]=1
I can only think of for loop based filling. Is there an efficient way to do this?
Thanks in advance.
First create groups by test non 0 values by not equal Series.ne with cumulative sum by Series.cumsum and compare first values of groups by GroupBy.transform with counter by GroupBy.cumcount for greater by Series.gt, last convert output boolean mask to integers for True, False to 1, 0 mapping:
g = df['Trade Days'].ne(0).cumsum()
grouped = df.groupby(g)['Trade Days']
df['new'] = grouped.transform('first').gt(grouped.cumcount()).astype(int)
print(df.head(10))
Trade Days open position new
2011-01-01 2 1 1
2011-01-02 0 1 1
2011-01-03 0 0 0
2011-01-04 0 0 0
2011-01-05 0 0 0
2011-01-06 3 1 1
2011-01-07 0 1 1
2011-01-08 0 1 1
2011-01-09 0 0 0
2011-01-10 0 0 0
I have a dataframe in Pandas with some columns, something like this:
data = {
'CODIGO_SINIESTRO': [10476434, 10476434, 4482524, 4482524, 4486110],
'CONDICION': ['PASAJERO', 'CONDUCTOR', 'MOTOCICLISTA', 'CICLISTA', 'PEATON'],
'EDAD': [62.0, 29.0, 26.0, 47.0, 33.0],
'SEXO': ['MASCULINO', 'FEMENINO', 'FEMENINO', 'MASCULINO', 'FEMENINO']
}
df = pd.DataFrame(data)
Output:
CODIGO_SINIESTRO CONDICION EDAD SEXO
0 10476434 PASAJERO 62.0 MASCULINO
1 10476434 CONDUCTOR 29.0 MASCULINO
2 4482524 MOTOCICLISTA 26.0 MASCULINO
3 4482524 CICLISTA 47.0 MASCULINO
4 4486110 PEATON 33.0 FEMENINO
So, I want to create another dataframe grouped by 'CODIGO_SINIESTRO' column, and I want the following columns like result:
'CODIGO_SINIESTRO': Id of the row.
'PROMEDIO_EDAD': This column will store edad mean.
'CANTIDAD_HOMBRES': This column will store masculine counts based on 'SEXO' column.
'CANTIDAD_HOMBRES': This column will store femenine counts based on 'SEXO' column.
Finally I want five extra columns named equal to the four values possibles of 'CONDICION' column, this values will store 1 if value exist or 0 if not.
So, I wrote this solution and working as expect, however I have many rows in my dataset (150k+) and the solution is slow (5 minutes). This is my code:
df_final = df.groupby(['CODIGO_SINIESTRO']).agg(
CANTIDAD_HOMBRES=pd.NamedAgg(column='SEXO', aggfunc=lambda x: (x=='MASCULINO').sum()),
CANTIDAD_MUJERES=pd.NamedAgg(column='SEXO', aggfunc=lambda x: (x=='FEMENINO').sum()),
PROMEDIO_EDAD=pd.NamedAgg(column='EDAD', aggfunc=np.mean),
MOTOCICLISTA=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='MOTOCICLISTA').any().astype(int)),
CONDUCTOR=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='CONDUCTOR').any().astype(int)),
PEATON=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='PEATON').any().astype(int)),
CICLISTA=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='CICLISTA').any().astype(int)),
PASAJERO=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='PASAJERO').any().astype(int))
).reset_index()
Output:
CODIGO_SINIESTRO CANTIDAD_HOMBRES CANTIDAD_MUJERES PROMEDIO_EDAD ...
0 4482524 1 1 36.5
1 4486110 0 1 33.0
2 10476434 1 1 45.5
... MOTOCICLISTA CONDUCTOR PEATON CICLISTA PASAJERO
1 0 0 1 0
0 0 1 0 0
0 1 0 0 1
How can I optimize this solution?, Are there other ways for resolving that?
Thank you.
Pre-aggregating with vectorized methods should be much more efficient (it turns out it was 100x faster):
df['PROMEDIO_EDAD']= df.groupby('CODIGO_SINIESTRO')['EDAD'].transform(np.mean)
df['CANTIDAD_HOMBRES'] = np.where(df['SEXO'] == 'MASCULINO', 1, 0)
df['CANTIDAD_MUJERES'] = np.where(df['SEXO'] == 'FEMENINO', 1, 0)
for col in df['CONDICION'].unique():
df[col] = np.where(df['CONDICION'] == col, 1, 0)
df = df.groupby(['CODIGO_SINIESTRO', 'PROMEDIO_EDAD']).sum().reset_index().drop('EDAD', axis=1)
df.iloc[:,2:] = (df.iloc[:,2:] > 0).astype(int)
df
Out[1]:
CODIGO_SINIESTRO PROMEDIO_EDAD CANTIDAD_HOMBRES CANTIDAD_MUJERES \
0 4482524 36.5 1 1
1 4486110 33.0 0 1
2 10476434 45.5 1 1
PASAJERO CONDUCTOR MOTOCICLISTA CICLISTA PEATON
0 0 0 1 1 0
1 0 0 0 0 1
2 1 1 0 0 0
Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4
First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4
I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1