how to get smallest index in dataframe after using groupby

how to get smallest index in dataframe after using groupby - python

If create_date field does not correspond to period between from_date and to_date, I want to extract only the large index records using group by 'indicator' and record correspond to period between from_date to end_date.
from_date = '2022-01-01'
to_date = '2022-04-10'
indicator create_date
0 A 2022-01-03
1 B 2021-12-30
2 B 2021-07-11
3 C 2021-02-10
4 C 2021-09-08
5 C 2021-07-24
6 C 2021-01-30
Here is the result I want:
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
I've been looking for a solution for a long time, but I only found a way "How to get the index of smallest value", and I can't find a way to compare the index number.

You can create helper column for maximal index values per indicator created by DataFrameGroupBy.idxmax, last select rows by DataFrame.loc:
df2 = df.loc[df.assign(tmp=df.index).groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
EDIT: If need seelct maximal index only per not match values between from_date, to_date use boolean indexing with join by concat:
from_date = '2022-01-01'
to_date = '2022-04-10'
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df2 = df.loc[df.assign(tmp=df.index)[~m].groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
2 B 2021-07-11
6 C 2021-01-30
df = pd.concat([df[m], df2])
print (df)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30

You can try
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df_ = df[~m].groupby('indicator', as_index=False).apply(lambda g: g.loc[[max(g.index)]]).droplevel(level=0)
out = pd.concat([df[m], df_], axis=0).sort_index()
print(out)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30

Related

Find missing days and grouping

I have a dataframe that looks something like this
dt user
0 2016-01-01 a
1 2016-01-02 a
2 2016-01-03 a
3 2016-01-04 a
4 2016-01-05 a
5 2016-01-06 a
6 2016-01-01 b
7 2016-01-02 b
8 2016-01-03 b
9 2016-01-04 b
10 2016-01-05 b
11 2016-01-06 b
12 2016-01-07 b
13 2015-12-31 c
14 2016-01-01 c
15 2016-01-02 c
16 2016-01-03 c
17 2016-01-04 c
18 2016-01-05 c
19 2016-01-06 c
20 2016-01-07 c
21 2016-01-08 c
22 2016-01-09 c
23 2016-01-10 c
I want to find the missing dates for each user. For the date ranges, the minimum date is 2015-12-31 and the maximum date is 2016-01-10. The result would look like this:
user missing_days
a 5
b 4
c 0

Use isin to check the date range against each group of user and agg.sum the returned boolean mask of each group
df['dt'] = pd.to_datetime(df['dt']) #if `dt` columns already in datetime dtype, ignore this
check_dates = pd.date_range('2015-12-31', '2016-01-10', freq='D')
s = df.groupby('user').dt.agg(lambda x: (~check_dates.isin(x)).sum())
Out[920]:
user
a 5
b 4
c 0
Name: dt, dtype: int64

### Convert your dates to datetime
df['dt'] = pd.to_datetime(df['dt'], infer_datetime_format=True)
### Create the list of dates per user
user_days = df.groupby('user')['dt'].apply(list)
### Initialize the final dataframe
df_miss_dates = pd.DataFrame(user_days)
all_dates = pd.date_range('2015-12-31', '2016-01-10', freq='D')
### Find the number of missing dates per user
df_miss_dates['missing_days'] = df_miss_dates['dt'].apply(lambda x: len(set(all_dates) - set(x)))
df_miss_dates.drop(columns='dt', inplace=True)
print(df_miss_dates)
Output:
missing_days
user
a 5
b 4
c 0

You can do it this way
from datetime import date, timedelta
sdate = date(2015, 12, 31) # start date
edate = date(2016, 1, 10) # end date
delta = edate - sdate # as timedelta
days=[]
for i in range(delta.days + 1):
day = sdate + timedelta(days=i)
days.append(str(day))
user=[]
missing_days = []
for user_n in df.user.unique():
user_days = df.loc[df.user ==user_n,'dt' ].to_list()
md = len([day for day in days if day not in user_days])
user.append(user_n)
missing_days.append(md)
new_df = pd.DataFrame({'user': user,'missing_days': missing_days})
new_df
output
user missing_days
a 5
b 4

Define the following function:
def missingDates(grp : pd.Series, d1 : pd.Timestamp, d2 : pd.Timestamp):
ndTotal = (d2 - d1).days + 1
ndPresent = grp[grp.between(d1, d2)].index.size
return ndTotal - ndPresent
Then apply it to each group and change into a DataFrame (as I see
from your post, you want just a DataFrame, with 2 columns):
result = df.groupby('user')['dt'].apply(missingDates,
pd.to_datetime('2015-12-31'), pd.to_datetime('2016-01-10'))\
.rename('missing_days').reset_index()
The result is:
user missing_days
0 a 5
1 b 4
2 c 0
My solution relies on the fact that dates within each group are unique
and all dates are without the time part. If these conditions were not
met, there should be added dates normalization and invoking of unique
function.
Additional remark: Change dt (the column name) to some other name,
because dt is the name of date accessor in Pandas.
It is a bad practice to "cover" standard pandasonic names with e.g.
either column or variable names.

How to add a column with conditions on another Dataframe?

Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4

First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4

How to conditionally compares values in one dataframe and match values in second dataframe if conditions are true and only returning certain columns?

I have two dataframes, 'df_orders' and 'df_birthday':
I need to return email address from 'df_birthday' by performing match on consumer_id column in 'df_orders' to id column in 'df_birthday' ONLY if certain conditions are met.
Condition 1:
consumer_id field in 'df_orders' appears ONLY once.
Condition 2:
payment_complete field in 'df_orders' field equals '1.0'
Condition 3:
if TIME NOW is ONLY 24 hours ahead of updated_at (datetime) field in 'df_orders'.
Condition 4:
If condition 1 and 2 are true return columns 'first_name', 'last_name' and 'email_addr' from df_birthday by matching 'consumer_id' from 'df_order' to 'id' in 'df_birthday'.
To Sum up all conditions:
Need to only return email_addr, first_name, and last_name from df_birthday if consumer_id field appears once, if payment_complete field equals '1.0', and updated_at field is exactly 24hours less than time now.
Here is my code(not sure how to retrieve columns 'first_name', 'last_name' and 'email_addr' from df_birthday if conditions 1-3 are true):
def first_purchase():
if df_order.groupby("consumer_id").filter(lambda x: len(x) == 1):
return "consumer_id"
elif df_order.loc[df_orders['payment_complete'] == 1.0]:
return 'payment_complete'
Should I write another function to compare the results? I am not even sure if this needs to in a function or for loop?
Here is for loop I have been tinkering with(not right):
for first_purchase in df_orders:
if df_orders.groupby("consumer_id").filter(lambda x: len(x) == 1):
elif df_orders.loc[df_orders['payment_complete'] == 1.0]:
else print 'fail'
Thank you in advance
Edit:
Sample Input:
df_birthday:
first_name last_name email_addr id
0 a A a#A 1
1 b B b#B 2
2 c B c#C 3
df_orders:
consumer_id payment_complete updated_at
0 1 1.0 2018-01-28
1 1 1.0 2018-01-28
2 2 1.0 2018-01-28
3 3 0 2018-01-28
Sample Output:
first_name last_name email_addr
0 b B b#B

You could filter a temporary copy of the orders dataframe first, and use that to filter the birthday dataframe, so only the records remain that we want to join. Then we can join the birthday dataframe back on the orders dataframe.
Working example below, hope this helps!
import pandas as pd
import numpy as np
import datetime as dt
df_birthday = pd.DataFrame([['a', 'A','a#A',1],
['b', 'B','b#B',2],
['c', 'B','c#C',3]],
columns=["first_name", "last_name",'email_addr','id'])
df_orders = pd.DataFrame([[1, 1.1],
[1, 1.0],
[2, 1.0],
[3, 0.0]],
columns=["consumer_id", "payment_complete"])
df_orders['updated_at'] = pd.to_datetime('today') + dt.timedelta(hours=1)
# Filters:
# Only occurs once
# Has payment complete == 1
# datetime difference with timestamp less than 24 hours.
df_temp = df_orders.groupby("consumer_id").filter(lambda x: len(x) == 1)
df_temp = df_temp[np.isclose(df_temp.payment_complete,1)]
df_temp = df_temp[(dt.datetime.now()- df_temp['updated_at']).astype('timedelta64[m]')<(24*60)]
# Filter the df_birthday dataframe, and join on our df_orders
df_birthday2 = df_birthday[df_birthday.id.isin(df_temp.consumer_id)]
print(df_birthday2)
# Only necessary if you want to join
df_orders = pd.merge(df_orders, df_birthday2, how='left', left_on='consumer_id', right_on='id')
df_orders = df_orders.drop('id',axis=1)
print(df_orders)
df_birthday:
first_name last_name email_addr id
0 a A a#A 1
1 b B b#B 2
2 c B c#C 3
df_orders:
consumer_id payment_complete updated_at
0 1 1 2018-01-28 01:00:00
1 1 1 2018-01-28 01:00:00
2 2 1 2018-01-28 01:00:00
3 3 0 2018-01-28 01:00:00
Resulting df_birthday2:
first_name last_name email_addr id
1 b B b#B 2
Resulting df_orders (if you run the last three lines):
consumer_id payment_complete updated_at first_name last_name email_addr
0 1 1 2018-01-28 01:00:00 NaN NaN NaN
1 1 1 2018-01-28 01:00:00 NaN NaN NaN
2 2 1 2018-01-28 01:00:00 b B b#B
3 3 0 2018-01-28 01:00:00 NaN NaN NaN

Pandas check for corresponding column in list and lowest date

I have a dataframe with multiple status fields per row. I want to check if any of the status fields have values in a list, and if so, I need to take the lowest date field for the corresponding status. My list of acceptable values and a sample dataframe look like this:
checkList = ['Foo','Bar']
df = pd.DataFrame([['A',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],['B','Foo',datetime.datetime(2017,10,1),'Other',datetime.datetime(2017,9,1),np.nan,np.nan],
['C','Bar',datetime.datetime(2016,1,1),np.nan,np.nan,'Foo',datetime.datetime(2016,5,5)]]
,columns = ['record','status1','status1_date','status2','status2_date','another_status','another_status_date'])
print df
record status1 status1_date status2 status2_date another_status \
0 A NaN NaT NaN NaT NaN
1 B Foo 2017-10-01 Other 2017-09-01 NaN
2 C Bar 2016-01-01 NaN NaT Foo
another_status_date
0 NaT
1 NaT
2 2016-05-05
I need to figure out if any of the statuses are in the approved list. If so, I need the first date for an approved status. The output would look like this:
print output_df
record master_status master_status_date
0 A False NaT
1 B True 2017-10-01
2 C True 2016-01-01
Thoughts on how best to approach? I can't just take min date, I'd need min where corresponding status field is in the list.

master_status = df.apply(lambda x: False if all([pd.isnull(rec) for rec in x[1:]]) else True, axis=1)
master_status_date = df.apply(lambda x: min([i for i in x[1:] if isinstance(i, datetime.datetime)]), axis=1)
record = df['record']
n_df = pd.concat([record, master_status, master_status_date], 1)
print(n_df)
record 0 1
0 A False NaT
1 B True 2017-09-01
2 C True 2016-01-01

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!

I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1

In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1

Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to get smallest index in dataframe after using groupby - python

Related

Find missing days and grouping

How to add a column with conditions on another Dataframe?

How to conditionally compares values in one dataframe and match values in second dataframe if conditions are true and only returning certain columns?

Pandas check for corresponding column in list and lowest date

Count number of rows for each ID within 1 year

Categories

Resources