Randomly sample from panel data by 3months periods

Randomly sample from panel data by 3months periods - python

I have pandas data frame that is panel data i.e. data of multiple customers over a timeframe. I want to sample (for bootstraping) a continuous three months period (i always wanna get a full month) of a random customer 90 times.
I have googled a bit and found several sampling techniques but none that would include sampling based on three continuous months.
I was considering just making a list of all the month names and sampling three consecutive ones (although not sure how to do consecutive). But how would i then be able to e.g. pick Nov21-Dec21-Jan22 ?
Would appreciate the help a lot!

import pandas as pd
date_range = pd.date_range("2020-01-01", "2022-01-01")
df = pd.DataFrame({"value":3}, index=date_range)
df.groupby(df.index.quarter).sample(5)
This would output:
Out[12]:
value
2021-01-14 3
2021-02-27 3
2020-01-20 3
2021-02-03 3
2021-02-19 3
2021-04-27 3
2021-06-29 3
2021-04-12 3
2020-06-24 3
2020-06-05 3
2021-07-30 3
2020-08-29 3
2021-07-03 3
2020-07-17 3
2020-09-12 3
2020-12-22 3
2021-12-13 3
2021-11-29 3
2021-12-19 3
2020-10-18 3
It selected 5 sample values form each quarter group.
From now own you can format date column (index) for it to write month in text.

Related

Python dataframe find closest date for each ID

I have a dataframe like this:
data = {'SalePrice':[10,10,10,20,20,3,3,1,4,8,8],'HandoverDateA':['2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-03-30','2022-03-30'],'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Red','Red','Blue','Blue','Red','Red','Red','Red'],'Totals':[100,100,100,50,50,90,90,70,60,70,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2022-03-14','2022-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15','2022-03-12'],'Amount':[20,10,14,34,15,60,25,10,10,40,20],'Opened':['2021-12-29','2021-12-29','2021-12-29','2022-12-29','2022-12-29','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29','2021-12-29']}
I need to find the sent date which is closest to the HandoverDate. I've seen plenty of examples that work when you give one date to search but here the date I want to be closest to can change for every ID. I have tried to adapt the following:
def nearest(items, pivot):
return min([i for i in items if i <= pivot], key=lambda x: abs(x - pivot))
And also tried to write a loop where I make a dataframe for each ID and use max on the date column then stick them together, but it's incredibly slow!
Thanks for any suggestions :)

IIUC, you can use:
data[['HandoverDateA', 'Sent']] = data[['HandoverDateA', 'Sent']].apply(pd.to_datetime)
out = data.loc[data['HandoverDateA']
.sub(data['Sent']).abs()
.groupby(data['ID']).idxmin()]
Output:
SalePrice HandoverDateA ID Tranche Totals Sent Amount Opened
10 8 2022-03-30 Adam Red 70 2022-03-12 20 2021-12-29
5 3 2022-04-30 Ben Blue 90 2022-03-03 60 2021-12-19
7 1 2022-04-30 Eden Red 70 2022-01-04 10 2021-12-29
4 20 2022-04-30 Joseph Red 50 2022-04-22 15 2022-12-29
8 4 2022-04-30 Tim Red 60 2022-01-10 10 2021-12-29
2 10 2022-04-30 Tom Red 100 2022-03-14 14 2021-12-29

Considering that the goal is to find the sent date which is closest to the HandoverDate, one approach would be as follows.
First of all, create the dataframe df from data with pandas.DataFrame
import pandas as pd
df = pd.DataFrame(data)
Then, make sure that the columns HandoverDateA and Sent are of datetime using pandas.to_datetime
df['HandoverDateA'] = pd.to_datetime(df['HandoverDateA'])
df['Sent'] = pd.to_datetime(df['Sent'])
Then, in order to make it more convenient, create a column, diff, to store the absolute value of the difference between the columns HandoverDateA and Sent
df['diff'] = (df['HandoverDateA'] - df['Sent']).dt.days.abs()
With that column, one can simply sort by that column as follows
df = df.sort_values(by=['diff'])
[Out]:
SalePrice HandoverDateA ID ... Amount Opened diff
4 20 2022-04-30 Joseph ... 15 2022-12-29 8
10 8 2022-03-30 Adam ... 20 2021-12-29 18
2 10 2022-04-30 Tom ... 14 2021-12-29 47
5 3 2022-04-30 Ben ... 60 2021-12-19 58
8 4 2022-04-30 Tim ... 10 2021-12-29 110
7 1 2022-04-30 Eden ... 10 2021-12-29 116
and the first row is the one where Sent is closest to HandOverDateA.
With the column diff, one option to get the one where diff is minimum is with pandas.DataFrame.query as follows
df = df.query('diff == diff.min()')
[Out]:
SalePrice HandoverDateA ID Tranche ... Sent Amount Opened diff
4 20 2022-04-30 Joseph Red ... 2022-04-22 15 2022-12-29 8
Notes:
For more information on sorting dataframes by columns, read my answer here.

How to get weekly averages for column values and week number for the corresponding year based on daily data records with pandas

I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!

It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143

Filtering and applying arthimetic expression in pandas data frame column

I have a column named volume in a pandas data frame and I wanted to look back previous 5 volumes from current column # and find 40 percentile .
Volume data - as follows
1200
3400
5000
2300
4502
3420
5670
5400
4320
7890
8790
For 1st 5 values we don’t have enough data to look back , but from 6th value 3420 we should find percentile (40) of previous 5 volumes 1200,3400,5000,2300,4502 and keep doing this for rest of the data by taking previous 5 data from current value.

Not sure if I understand correctly since there is no mcve
However, sounds like you want a rolling quantile
>>> s.rolling(5).quantile(0.4)
0 NaN
1 NaN
2 NaN
3 NaN
4 2960.0
5 3412.0
6 4069.2
7 4069.2
8 4429.2
9 4968.0
10 5562.0
dtype: float64

Rolling sum based on dates, adding in conditions that actively update values in Pandas Dataframe if met?

I am calculating rolling last 180 day sales totals by ID in Python using Pandas and need to be able to update the last 180 day cumulative sales column if a user hits a certain threshold. For example, if someone reaches $100 spent cumulatively in the last 180 days, their cumulative spend for that day should reflect them reaching that level and effectively "redeeming" that $100, leaving them only with the excess from the last visit as progress towards their next $100 hit. (See the example below)
I also need to create a separate data frame during this process containing only the dates & user_ids for when the $100 is met to keep track of how many times the threshold has been met across all users.
I was thinking somehow I could use apply with conditional statements, but was not sure exactly how it would work as the data frame needs to be updated on the fly to have the rolling sums for later dates be calculated taking into account this updated total. In other words, the cumulative sums for dates after they hit the threshold need to be adjusted for the fact that they "redeemed" the $100.
This is what I have so far that gets the rolling cumulative sum by user. I don't know if its possible to chain conditional methods with apply to this or what the best way forward is.
order_data['rolling_sales_180'] = order_data.groupby('user_id').rolling(window='180D', on='day')['sales'].sum().reset_index(drop=True)
See the below example of expected results. In row 6, the user reaches $120, crossing the $100 threshold, but the $100 is subtracted from his cumulative sum as of that date and he is left with $20 as of that date because that was the amount in excess of the $100 threshold that he spent on that day. He then continues to earn cumulatively on this $20 for his subsequent visit within 180 days. A user can go through this process many times, earning many rewards over different 180 day periods.
print(order_data)
day user_id sales \
0 2017-08-10 1 10
1 2017-08-22 1 10
2 2017-08-31 1 10
3 2017-09-06 1 10
4 2017-09-19 1 10
5 2017-10-16 1 30
6 2017-11-28 1 40
7 2018-01-22 1 10
8 2018-03-19 1 10
9 2018-07-25 1 10
rolling_sales_180
0 10
1 20
2 30
3 40
4 50
5 80
6 20
7 30
8 40
9 20
Additionally, as mentioned above, I need a separate data frame to be created throughout this process with the day, user_id, sales, and rolling_sales_180 that only includes all the days during which the $100 threshold was met in order to count the number of times this goal is reached. See below:
print(threshold_reached)
day user_id sales rolling_sales_180
0 2017-11-28 1 40 120
.
.
.

If I understand your question correctly, the following should work for you:
def groupby_rolling(grp_df):
df = grp_df.set_index("day")
cum_sales = df.rolling("180D")["sales"].sum()
hundreds = (cum_sales // 100).astype(int)
progress = cum_sales % 100
df["rolling_sales_180"] = cum_sales
df["progress"] = progress
df["milestones"] = hundreds
return df
result = df.groupby("user_id").apply(groupby_rolling)
Output of this is (for your provided sample):
user_id sales rolling_sales_180 progress milestones
user_id day
1 2017-08-10 1 10 10.0 10.0 0
2017-08-22 1 10 20.0 20.0 0
2017-08-31 1 10 30.0 30.0 0
2017-09-06 1 10 40.0 40.0 0
2017-09-19 1 10 50.0 50.0 0
2017-10-16 1 30 80.0 80.0 0
2017-11-28 1 40 120.0 20.0 1
2018-01-22 1 10 130.0 30.0 1
2018-03-19 1 10 90.0 90.0 0
2018-07-25 1 10 20.0 20.0 0
What the groupby(...).apply(...) does is for each group in the original df, the provided function is applied. In this case, I've encapsulated your complex logic, which is currently not possible to do with a straightforward groupby-rolling operation, in a simple-to-parse basic function.
The function should hopefully be self-documenting by how I named variables, but I'd be happy to add comments if you'd like.

How to group my time by month / week in pd.DataFrame

I have this DataFrame about my Facebook that says, the events I interested at, I joined and the respective time frame for them. I am having some problem of grouping the time by month or week, because there are two of them
joined_time interested_time
0 2019-04-01 2019-04-21
1 2019-03-15 2019-04-06
2 2019-03-13 2019-03-26
Both time indicates when I clicked the 'Going' or 'Interested' button when an event pops up in Facebook. Sorry for the very small sample size, but this is what I have simplified it down to at the moment. And what I am trying to achieve here is that,
Year Month Total_Events_No Events_Joined Events_Interested
2019 3 3 2 1
4 3 1 2
Where in this DataFrame, the year and month are multi-index, and the other columns consist of the counts of respective situations.

I am using melt before groupby and unstack
s=df.melt()
s.value=pd.to_datetime(s.value)
s=s.groupby([s.value.dt.year,s.value.dt.month,s.variable]).size().unstack()
s['Total']=s.sum(axis=1)
s
variable interested_time joined_time Total
value value
2019 3 1 2 3
4 2 1 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.