Apply function to a range of specific rows - python

I have the following dataframe df:
bucket_value is_new_bucket
dates
2019-03-07 0 1
2019-03-08 1 0
2019-03-09 2 0
2019-03-10 3 0
2019-03-11 4 0
2019-03-12 5 1
2019-03-13 6 0
2019-03-14 7 1
I want to apply a specific function (let’s say the mean function) to each bucket_value data groups where the column is_new_bucket is equal to zero, such that the resulting dataframe would look like this:
mean_values
dates
2019-03-08 2.5
2019-03-13 6.0
In other words, applying a function to the consecutive rows where is_new_bucket = 0, which takes the bucket_value as input.
For instance, if I want to apply the max function, the resulting dataframe would look like this:
max_values
dates
2019-03-11 4.0
2019-03-13 6.0

Using cumsum with filter
df.reset_index(inplace=True)
s=df.loc[df.is_new_bucket==0].groupby(df.is_new_bucket.cumsum()).agg({'date':'first','bucket_value':['mean','max']})
s
date bucket_value
first mean max
is_new_bucket
1 2019-03-08 2.5 4
2 2019-03-13 6.0 6
Updated
df.loc[df.loc[df.is_new_bucket==0].groupby(df.is_new_bucket.cumsum())['bucket_value'].idxmax()]
date bucket_value is_new_bucket
4 2019-03-11 4 0
6 2019-03-13 6 0
Updated2 after using the cumsum create the group key Newkey , you can do whatever you need , base on the groupkey
df['Newkey']=df.is_new_bucket.cumsum()
df
date bucket_value is_new_bucket Newkey
0 2019-03-07 0 1 1
1 2019-03-08 1 0 1
2 2019-03-09 2 0 1
3 2019-03-10 3 0 1
4 2019-03-11 4 0 1
5 2019-03-12 5 1 2
6 2019-03-13 6 0 2
7 2019-03-14 7 1 3

Related

Substract previous row from preceding row by group WITH condition

I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days

Group nearby dates

I want to group nearby dates together, using a rolling window (?) of three week periods.
See example and attempt below:
import pandas as pd
d = {'id':[1, 1, 1, 1, 2, 3],
'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])
# id datefield
#0 1 2021-01-01
#1 1 2021-01-15
#2 1 2021-02-01
#3 2 2020-02-10
#4 3 2020-02-20
df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()
# id datefield event
#0 1 2021-01-01 0
#1 1 2021-01-15 0
#2 1 2021-01-30 1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3 1 2021-02-05 1 #Should be 0
#4 2 2020-02-10 2
#5 3 2020-02-20 3 #Correct, within 3 weeks of another but since the ids are not the same the event is different
Can compute different columns to make it easily understandable
df
id datefield
0 1 2021-01-01
1 1 2021-01-15
2 1 2021-01-30
3 1 2021-02-05
4 2 2020-02-10
5 2 2020-03-20
Calculate difference between dates in number of days
df['diff'] = df['datefield'].diff().dt.days
Get previous ID
df['prevId'] = df['id'].shift()
Decide whether to increment or not
df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)
Lastly, just get the cumulative sum
df['event'] = df['increment'].cumsum()
Output
id datefield diff prevId increment event
0 1 2021-01-01 NaN NaN 1 1
1 1 2021-01-15 14.0 1.0 0 1
2 1 2021-01-30 15.0 1.0 0 1
3 1 2021-02-05 6.0 1.0 0 1
4 2 2020-02-10 -361.0 1.0 1 2
5 2 2020-03-20 39.0 2.0 1 3
Let's try a different approach using a boolean series instead:
df['group'] = ((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift()))).cumsum()
Output:
id datefield group
0 1 2021-01-01 1
1 1 2021-01-15 1
2 1 2021-01-30 1
3 1 2021-02-05 1
4 2 2020-02-10 2
5 2 2020-03-20 3
Is the difference between the previous row greater than 3 weeks:
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))))
0 False
1 False
2 False
3 False
4 False
5 True
Name: datefield, dtype: bool
Or is the current id not equal to the previous id:
print((df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 False
Name: id, dtype: bool
or (|) together the conditions
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 True
dtype: bool
Then use cumsum to increment every where there is a True value to delimit the groups.
*Assumes id and datafield columns are appropriately ordered.
It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0:
df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()
output:
id datefield delta_t group
0 2 2020-02-10 0 days 0
1 2 2020-03-20 39 days 1
2 1 2021-01-01 326 days 2
3 1 2021-01-15 340 days 2
4 1 2021-01-30 355 days 2
5 1 2021-02-05 361 days 2
Note that your original dataframe is not sorted properly.

Pandas calculate average value of column for rows satisfying condition

I have a dataframe containing information about users rating items during a period of time. It has the following semblance :
In the dataframe I have a number of rows with identical 'user_id' and 'business_id' which i retrieve using the following code :
mask = reviews_df.duplicated(subset=['user_id','business_id'], keep=False)
dup = reviews_df[mask]
obtaining something like this :
I now need to remove all such duplicates from the original dataframe and substitute them with their average. Is there a fast and elegant way to achive this?Thanks!
Se if you do have a dataframe looks like
review_id user_id business_id stars date
0 1 0 3 2.0 2019-01-01
1 2 1 3 5.0 2019-11-11
2 3 0 2 4.0 2019-10-22
3 4 3 4 3.0 2019-09-13
4 5 3 4 1.0 2019-02-14
5 6 0 2 5.0 2019-03-17
Then the solution should be something like that:
df.loc[df.duplicated(['user_id', 'business_id'], keep=False)]\
.groupby(['user_id', 'business_id'])\
.apply(lambda x: x.stars - x.stars.mean())
With the following result:
user_id business_id
0 2 2 -0.5
5 0.5
3 4 3 1.0
4 -1.0

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

Find days since last event pandas dataframe

I have a pandas data frame:
df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})
group_ids dates event_today_in_group
0 1 2016-04-01 1
1 1 2016-04-20 0
2 1 2016-04-28 1
3 2 2016-04-05 1
4 2 2016-04-20 1
5 2 2016-04-29 0
I would like to compute an additional column that contains, for each group_ids, the number of days since the last time event_today_in_group was 1.
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 0
1 1 2016-04-20 0 19
2 1 2016-04-28 1 27
3 2 2016-04-05 1 0
4 2 2016-04-20 1 15
5 2 2016-04-29 0 9
As I mentioned earlier, this will get you the non-cumulative difference between dates within each group:
df['days_since_last_event'] = df.groupby('group_ids')['dates'].diff().apply(lambda x: x.days)
In order to get a cumulative sum of this difference, based on whenever event_today_in_group changes, I propose using shift to get the value of the previous row, and then generating a cumulative sum, like so:
df['event_today_in_group'].shift().cumsum()
Output:
0 NaN
1 1.0
2 1.0
3 2.0
4 3.0
5 4.0
This gives us the second grouping value we need to get the cumulative sums. You could assign the above values to a new column, but if you're only using them for the calculation, then you can simply include them in the subsequent groupby operation like so:
df.loc[:, 'days_since_last_event'] = df.groupby(['group_ids', df['event_today_in_group'].shift().cumsum()])['days_since_last_event'].cumsum()
Result:
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 NaN
1 1 2016-04-20 0 19.0
2 1 2016-04-28 1 27.0
3 2 2016-04-05 1 NaN
4 2 2016-04-20 1 15.0
5 2 2016-04-29 0 9.0

Categories