I want to group nearby dates together, using a rolling window (?) of three week periods.
See example and attempt below:
import pandas as pd
d = {'id':[1, 1, 1, 1, 2, 3],
'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])
# id datefield
#0 1 2021-01-01
#1 1 2021-01-15
#2 1 2021-02-01
#3 2 2020-02-10
#4 3 2020-02-20
df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()
# id datefield event
#0 1 2021-01-01 0
#1 1 2021-01-15 0
#2 1 2021-01-30 1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3 1 2021-02-05 1 #Should be 0
#4 2 2020-02-10 2
#5 3 2020-02-20 3 #Correct, within 3 weeks of another but since the ids are not the same the event is different
Can compute different columns to make it easily understandable
df
id datefield
0 1 2021-01-01
1 1 2021-01-15
2 1 2021-01-30
3 1 2021-02-05
4 2 2020-02-10
5 2 2020-03-20
Calculate difference between dates in number of days
df['diff'] = df['datefield'].diff().dt.days
Get previous ID
df['prevId'] = df['id'].shift()
Decide whether to increment or not
df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)
Lastly, just get the cumulative sum
df['event'] = df['increment'].cumsum()
Output
id datefield diff prevId increment event
0 1 2021-01-01 NaN NaN 1 1
1 1 2021-01-15 14.0 1.0 0 1
2 1 2021-01-30 15.0 1.0 0 1
3 1 2021-02-05 6.0 1.0 0 1
4 2 2020-02-10 -361.0 1.0 1 2
5 2 2020-03-20 39.0 2.0 1 3
Let's try a different approach using a boolean series instead:
df['group'] = ((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift()))).cumsum()
Output:
id datefield group
0 1 2021-01-01 1
1 1 2021-01-15 1
2 1 2021-01-30 1
3 1 2021-02-05 1
4 2 2020-02-10 2
5 2 2020-03-20 3
Is the difference between the previous row greater than 3 weeks:
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))))
0 False
1 False
2 False
3 False
4 False
5 True
Name: datefield, dtype: bool
Or is the current id not equal to the previous id:
print((df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 False
Name: id, dtype: bool
or (|) together the conditions
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 True
dtype: bool
Then use cumsum to increment every where there is a True value to delimit the groups.
*Assumes id and datafield columns are appropriately ordered.
It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0:
df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()
output:
id datefield delta_t group
0 2 2020-02-10 0 days 0
1 2 2020-03-20 39 days 1
2 1 2021-01-01 326 days 2
3 1 2021-01-15 340 days 2
4 1 2021-01-30 355 days 2
5 1 2021-02-05 361 days 2
Note that your original dataframe is not sorted properly.
Related
I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days
I want to get the sum of values for next 7 days of a column
my dataframe :
date value
0 2021-04-29 1
1 2021-05-03 2
2 2021-05-06 1
3 2021-05-15 1
4 2021-05-17 2
5 2021-05-18 1
6 2021-05-21 2
7 2021-05-22 5
8 2021-05-24 4
i tried to make a new column that contains date 7 days from current date
df['temp'] = df['date'] + timedelta(days=7)
then calculate value between date range :
df['next_7days'] = df[(df.date > df.date) & (df.date <= df.temp)].value.sum()
But this gives me answer as all 0.
intended result:
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
The method iam using currently is quite tedious, are their any better methods to get the intended result.
With a list comprehension:
tomorrow_dates = df.date + pd.Timedelta("1 day")
next_week_dates = df.date + pd.Timedelta("7 days")
df["next_7days"] = [df.value[df.date.between(tomorrow, next_week)].sum()
for tomorrow, next_week in zip(tomorrow_dates, next_week_dates)]
where we first define tomorrow and next week's dates and store them. Then zip them together and use between of pd.Series to get a boolean series if the date is indeed between the desired range. Then using boolean indexing to get the actual values and sum them. Do this for each date pair.
to get
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
I need to select the rows of the last value for each user_id and date, but when the last value in the metric column is 'leave' select the last 2 rows(if exists).
My data:
df = pd.DataFrame({
"user_id": [1,1,1, 2,2,2]
,'subscription': [1,1,2,3,4,5]
,"metric": ['enter', 'stay', 'leave', 'enter', 'leave', 'enter']
,'date': ['2020-01-01', '2020-01-01', '2020-03-01', '2020-01-01', '2020-01-01', '2020-01-02']
})
#result
user_id subscription metric date
0 1 1 enter 2020-01-01
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Expected output:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01 # stay because last metric='leave' inside group[user_id, date]
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
What I've tried: drop_duplicates and groupby, both give the same result, only with the last value
df.drop_duplicates(['user_id', 'date'], keep='last')
#or
df.groupby(['user_id', 'date']).tail(1)
You can use boolean masking and return three different conditions that are True or False with variables a, b, or c. Then, filter for when the data a, b, or c returns True with the or operator |:
a = df.groupby(['user_id', 'date', df.groupby(['user_id', 'date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id', 'date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a, b, c)
df
#a groupby the two required groups plus a group that finds the cumulative count, which is necessary in order to return True for the last "metric" within the the group.
0 False
1 False
2 True
3 False
4 True
5 False
Name: metric, dtype: bool
#b if something has a count of one, then you want to keep it.
0 False
1 False
2 True
3 False
4 False
5 True
Name: metric, dtype: bool
#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0 False
1 True
2 False
3 True
4 False
5 False
Name: metric, dtype: bool
Out[18]:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
This is one way, but in my opinion, slow, since we are iterating through the grouping :
df["date"] = pd.to_datetime(df["date"])
df = df.assign(metric_is_leave=df.metric.eq("leave"))
pd.concat(
[
value.iloc[-2:, :-1] if value.metric_is_leave.any() else value.iloc[-1:, :-1]
for key, value in df.groupby(["user_id", "date"])
]
)
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.
I have a pandas data frame:
df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})
group_ids dates event_today_in_group
0 1 2016-04-01 1
1 1 2016-04-20 0
2 1 2016-04-28 1
3 2 2016-04-05 1
4 2 2016-04-20 1
5 2 2016-04-29 0
I would like to compute an additional column that contains, for each group_ids, the number of days since the last time event_today_in_group was 1.
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 0
1 1 2016-04-20 0 19
2 1 2016-04-28 1 27
3 2 2016-04-05 1 0
4 2 2016-04-20 1 15
5 2 2016-04-29 0 9
As I mentioned earlier, this will get you the non-cumulative difference between dates within each group:
df['days_since_last_event'] = df.groupby('group_ids')['dates'].diff().apply(lambda x: x.days)
In order to get a cumulative sum of this difference, based on whenever event_today_in_group changes, I propose using shift to get the value of the previous row, and then generating a cumulative sum, like so:
df['event_today_in_group'].shift().cumsum()
Output:
0 NaN
1 1.0
2 1.0
3 2.0
4 3.0
5 4.0
This gives us the second grouping value we need to get the cumulative sums. You could assign the above values to a new column, but if you're only using them for the calculation, then you can simply include them in the subsequent groupby operation like so:
df.loc[:, 'days_since_last_event'] = df.groupby(['group_ids', df['event_today_in_group'].shift().cumsum()])['days_since_last_event'].cumsum()
Result:
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 NaN
1 1 2016-04-20 0 19.0
2 1 2016-04-28 1 27.0
3 2 2016-04-05 1 NaN
4 2 2016-04-20 1 15.0
5 2 2016-04-29 0 9.0