Substract previous row from preceding row by group WITH condition - python

I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>

You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17

I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days

Related

Python delete rows for each group after first occurance in a column

I Have a dataframe as follows:
df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})
I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:
Required Output
My Approach
Sort df by Key & Date ( Sample input is already sorted)
drop the rows that occur after 'H' Activity in each group as follows:
Groupby df.groupby(['Key', 'Activity']).count()
Is there a better approach , if not then help me in code for dropping the rows that occur after 'H' Activity in each group.
Thanks in advance !
You can bring the H dates "back" into each previous row to use in a comparison.
First mark each H date in a new column:
df.loc[df["Activity"] == "H" , "End"] = df["Date"]
Key Activity Date End
0 1 A 2022-12-03 NaT
1 1 A 2022-12-04 NaT
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 NaT
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 NaT
8 4 C 2022-12-04 NaT
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
Backward fill the new column for each group:
df["End"] = df.groupby("Key")["End"].bfill()
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 2022-12-06
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
You can then select rows with Date before End
df.loc[df["Date"] < df["End"]]
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
4 2 B 2022-12-03 2022-12-06
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
To generate the final form - you can use .pivot_table()
(df.loc[df["Date"] < df["End"]]
.pivot_table(index="Key", columns="Activity", values="Date", aggfunc="count")
.reindex(df["Key"].unique()) # Add in keys with no match e.g. `5`
.fillna(0)
.astype(int))
Activity A B C
Key
1 2 0 0
2 0 1 0
4 1 0 1
5 0 0 0
Try this:
(df.loc[df['Activity'].eq('H').groupby(df['Key']).cumsum().eq(0)]
.set_index('Key')['Activity']
.str.get_dummies()
.groupby(level=0).sum()
.reindex(df['Key'].unique(),fill_value=0)
.reset_index())
Output:
Key A B C
0 1 2 0 0
1 2 0 1 0
2 4 1 0 1
3 5 0 0 0
You can try:
# sort by Key and Date
df.sort_values(['Key', 'Date'], inplace=True)
# this is to keep Key in the result when no values are kept after the filter
df.Key = df.Key.astype('category')
# filter all rows after the 1st H for each Key and then pivot
df[~df.Activity.eq('H').groupby(df.Key).cummax()].pivot_table(
index='Key', columns='Activity', aggfunc='size'
).reset_index()
#Activity Key A B C
#0 1 2 0 0
#1 2 0 1 0
#2 4 1 0 1
#3 5 0 0 0

Group nearby dates

I want to group nearby dates together, using a rolling window (?) of three week periods.
See example and attempt below:
import pandas as pd
d = {'id':[1, 1, 1, 1, 2, 3],
'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])
# id datefield
#0 1 2021-01-01
#1 1 2021-01-15
#2 1 2021-02-01
#3 2 2020-02-10
#4 3 2020-02-20
df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()
# id datefield event
#0 1 2021-01-01 0
#1 1 2021-01-15 0
#2 1 2021-01-30 1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3 1 2021-02-05 1 #Should be 0
#4 2 2020-02-10 2
#5 3 2020-02-20 3 #Correct, within 3 weeks of another but since the ids are not the same the event is different
Can compute different columns to make it easily understandable
df
id datefield
0 1 2021-01-01
1 1 2021-01-15
2 1 2021-01-30
3 1 2021-02-05
4 2 2020-02-10
5 2 2020-03-20
Calculate difference between dates in number of days
df['diff'] = df['datefield'].diff().dt.days
Get previous ID
df['prevId'] = df['id'].shift()
Decide whether to increment or not
df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)
Lastly, just get the cumulative sum
df['event'] = df['increment'].cumsum()
Output
id datefield diff prevId increment event
0 1 2021-01-01 NaN NaN 1 1
1 1 2021-01-15 14.0 1.0 0 1
2 1 2021-01-30 15.0 1.0 0 1
3 1 2021-02-05 6.0 1.0 0 1
4 2 2020-02-10 -361.0 1.0 1 2
5 2 2020-03-20 39.0 2.0 1 3
Let's try a different approach using a boolean series instead:
df['group'] = ((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift()))).cumsum()
Output:
id datefield group
0 1 2021-01-01 1
1 1 2021-01-15 1
2 1 2021-01-30 1
3 1 2021-02-05 1
4 2 2020-02-10 2
5 2 2020-03-20 3
Is the difference between the previous row greater than 3 weeks:
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))))
0 False
1 False
2 False
3 False
4 False
5 True
Name: datefield, dtype: bool
Or is the current id not equal to the previous id:
print((df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 False
Name: id, dtype: bool
or (|) together the conditions
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 True
dtype: bool
Then use cumsum to increment every where there is a True value to delimit the groups.
*Assumes id and datafield columns are appropriately ordered.
It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0:
df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()
output:
id datefield delta_t group
0 2 2020-02-10 0 days 0
1 2 2020-03-20 39 days 1
2 1 2021-01-01 326 days 2
3 1 2021-01-15 340 days 2
4 1 2021-01-30 355 days 2
5 1 2021-02-05 361 days 2
Note that your original dataframe is not sorted properly.

How to drop rows for each value in a column using a condition?

I have the following dataframe:
df = pd.DataFrame({'No': [0,0,0,1,1,2,2],
'date':['2020-01-15','2019-12-16','2021-03-01', '2018-05-19', '2016-04-08', '2020-01-02', '2020-03-07']})
df.date =pd.to_datetime(df.date)
No date
0 0 2018-01-15
1 0 2019-12-16
2 0 2021-03-01
3 1 2018-05-19
4 1 2016-04-08
5 2 2020-01-02
6 2 2020-03-07
I want to drop the rows if all the date values are earlier than 2020-01-01 for each unique number in No column, i.e. I want to drop rows with the indices 3 and 4.
Is it possible to do it without a for loop?
Use groupby and transform:
>>> df[df.groupby('No')['date'].transform('max')>='2020-01-01']
No date
0 0 2020-01-15
1 0 2019-12-16
2 0 2021-03-01
5 2 2020-01-02
6 2 2020-03-07

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

Find days since last event pandas dataframe

I have a pandas data frame:
df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})
group_ids dates event_today_in_group
0 1 2016-04-01 1
1 1 2016-04-20 0
2 1 2016-04-28 1
3 2 2016-04-05 1
4 2 2016-04-20 1
5 2 2016-04-29 0
I would like to compute an additional column that contains, for each group_ids, the number of days since the last time event_today_in_group was 1.
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 0
1 1 2016-04-20 0 19
2 1 2016-04-28 1 27
3 2 2016-04-05 1 0
4 2 2016-04-20 1 15
5 2 2016-04-29 0 9
As I mentioned earlier, this will get you the non-cumulative difference between dates within each group:
df['days_since_last_event'] = df.groupby('group_ids')['dates'].diff().apply(lambda x: x.days)
In order to get a cumulative sum of this difference, based on whenever event_today_in_group changes, I propose using shift to get the value of the previous row, and then generating a cumulative sum, like so:
df['event_today_in_group'].shift().cumsum()
Output:
0 NaN
1 1.0
2 1.0
3 2.0
4 3.0
5 4.0
This gives us the second grouping value we need to get the cumulative sums. You could assign the above values to a new column, but if you're only using them for the calculation, then you can simply include them in the subsequent groupby operation like so:
df.loc[:, 'days_since_last_event'] = df.groupby(['group_ids', df['event_today_in_group'].shift().cumsum()])['days_since_last_event'].cumsum()
Result:
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 NaN
1 1 2016-04-20 0 19.0
2 1 2016-04-28 1 27.0
3 2 2016-04-05 1 NaN
4 2 2016-04-20 1 15.0
5 2 2016-04-29 0 9.0

Categories