Generate date ranges using group by + apply in pandas - python

I want to imitate prophet make_future_dataframe() functionality for multiple groups in a pandas dataframe.
If I would like to create a date range as a separate column I could do:
import pandas as pd
my_dataframe['prediction_range'] = pd.date_range(start=my_dataframe['date_column'].min(),
periods=48,
freq='M')
However my dataframe has the following structure:
id feature1 feature2 date_column
1 0 4.3 2022-01-01
2 0 3.3 2022-01-01
3 0 2.2 2022-01-01
4 1034 1.11 2022-01-01
5 1090 0.98 2022-01-01
6 1078 0 2022-01-01
I wanted to do the following:
def generate_date_range(date_column, data):
dates = pd.date_range(start=data[date_column].unique()[0],
periods=48,
freq='M')
return dates
And then:
my_dataframe = my_dataframe.groupby('id').apply(generate_date_ranges('date_columns', my_dataframe))
But I am getting the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1377, in apply
func = com.is_builtin_func(func)
File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/common.py", line 615, in is_builtin_func
return _builtin_table.get(arg, arg)
TypeError: unhashable type: 'DatetimeIndex'
I am not sure if I am approaching the problem in the right way. I have also done this with a MultiIndex:
multi_index = pd.MultiIndex.from_product([pd.Index(file['id'].unique()), dates], names=('customer', 'prediction_date'))
And then reindexing an filling the NANs but I am not able to understand why the apply version does not work.
The desired output is:
id feature1 feature2 date_column prediction_date
1 0 4.3 2022-01-01 2022-03-01
1 0 4.3 2022-01-01 2022-04-01
1 0 4.3 2022-01-01 2022-05-01
1 0 4.3 2022-01-01 2022-06-01
--- Up to 48 periods --
2 0 3.3 2022-01-01 2022-03-01
2 0 3.3 2022-01-01 2022-04-01
2 0 3.3 2022-01-01 2022-05-01
2 0 3.3 2022-01-01 2022-06-01
BR
E

Try doing some list comprehension on your groupby object where you reindex the dates then forward fill the id
df['date_column'] = pd.to_datetime(df['date_column'])
df = df.set_index('date_column')
new_df = pd.concat([g.reindex(pd.date_range(g.index.min(), periods=48, freq='MS'))
for _,g in df.groupby('id')])
new_df['id'] = new_df['id'].ffill().astype(int)
id feature1 feature2
2022-01-01 1 0.0 4.3
2022-02-01 1 NaN NaN
2022-03-01 1 NaN NaN
2022-04-01 1 NaN NaN
2022-05-01 1 NaN NaN
... .. ... ...
2025-08-01 6 NaN NaN
2025-09-01 6 NaN NaN
2025-10-01 6 NaN NaN
2025-11-01 6 NaN NaN
2025-12-01 6 NaN NaN
Update
If there is only one record for each ID we can do the following. If there is more than one record for each ID then we will need to only keep the min value for each ID, preform the task below and merge everything back together.
# make sure your date is datetime
df['date_column'] = pd.to_datetime(df['date_column'])
# use index.repeat for the number of months you want
# in this case we will offset the min date for 48 months
new_df = df.reindex(df.index.repeat(48)).reset_index(drop=True)
# groupby the id, cumcount and set the type so we can offset
new_df['date_column'] = new_df['date_column'].values.astype('datetime64[M]') + \
new_df.groupby('id')['date_column'].cumcount().values.astype('timedelta64[M]')
id feature1 feature2 date_column
0 1 0 4.3 2022-01-01
1 1 0 4.3 2022-02-01
2 1 0 4.3 2022-03-01
3 1 0 4.3 2022-04-01
4 1 0 4.3 2022-05-01
.. .. ... ... ...
283 6 1078 0.0 2025-08-01
284 6 1078 0.0 2025-09-01
285 6 1078 0.0 2025-10-01
286 6 1078 0.0 2025-11-01
287 6 1078 0.0 2025-12-01

Related

How to calculate month by month change in value per user in pandas?

I was looking for similar topics, but I found only change by month. My problem is that I would like to have a month change in value e.g. UPL but per user like in the below example.
user_id
month
UPL
1
2022-01-01 00:00:00
100
1
2022-02-01 00:00:00
200
2
2022-01-01 00:00:00
100
2
2022-02-01 00:00:00
50
1
2022-03-01 00:00:00
150
And to have additional column named "UPL change month by month":
user_id
month
UPL
UPL_change_by_month
1
2022-01-01 00:00:00
100
0
1
2022-02-01 00:00:00
200
100
2
2022-01-01 00:00:00
100
0
2
2022-02-01 00:00:00
50
-50
1
2022-03-01 00:00:00
150
-50
Is it possible using aggfunc or shift function using Pandas?
IIUC, you can use groupby_diff:
df['UPL_change_by_month'] = df.sort_values('month').groupby('user_id')['UPL'].diff().fillna(0)
print(df)
# Output
user_id month UPL UPL_change_by_month
0 1 2022-01-01 100 0.0
1 1 2022-02-01 200 100.0
2 2 2022-01-01 100 0.0
3 2 2022-02-01 50 -50.0
4 1 2022-03-01 150 -50.0

Fill monthly holes (time-series) in a pandas dataframe with several categories [duplicate]

This question already has answers here:
Pandas filling missing dates and values within group
(3 answers)
Closed 4 months ago.
I have a time-series in pandas with several products (id's: a, b, etc), but with monthly holes. I have to fill those holes. It may be with np.nan or any other constant. I tried groupby but I wasnt able.
date id units
2022-01-01 a 10
2022-01-01 b 100
2022-02-01 a 15
2022-03-01 a 30
2022-03-01 b 70
2022-05-01 b 60
2022-06-01 a 8
2022-06-01 b 90
Should be:
date id units
2022-01-01 a 10
2022-01-01 b 100
2022-02-01 a 15
2022-02-01 b np.nan
2022-03-01 a 30
2022-03-01 b 70
2022-04-01 a np.nan
2022-04-01 b np.nan
2022-05-01 a np.nan
2022-05-01 b 60
2022-06-01 a 8
2022-06-01 b 90
You can do pivot then stack
df = df.pivot(*df.columns).stack(dropna = False).reset_index(name = 'units')
Out[126]:
date id units
0 2022-01-01 a 10.0
1 2022-01-01 b 100.0
2 2022-02-01 a 15.0
3 2022-02-01 b NaN
4 2022-03-01 a 30.0
5 2022-03-01 b 70.0
6 2022-05-01 a NaN
7 2022-05-01 b 60.0
8 2022-06-01 a 8.0
9 2022-06-01 b 90.0
df2=(df.set_index('date' )
.groupby('id', group_keys=False)
.apply(lambda x: x.resample('1MS').asfreq(fill_value=np.nan) )
.reset_index() )
df2['id'].ffill(inplace=True)
df2
date id units
0 2022-01-01 a 10.0
1 2022-02-01 a 15.0
2 2022-03-01 a 30.0
3 2022-04-01 a NaN
4 2022-05-01 a NaN
5 2022-06-01 a 8.0
6 2022-01-01 b 100.0
7 2022-02-01 b NaN
8 2022-03-01 b 70.0
9 2022-04-01 b NaN
10 2022-05-01 b 60.0
11 2022-06-01 b 90.0

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

How to get 1 for 8 days after a date in pandas and 0 otherwise?

I have two dataframes:
daily = pd.DataFrame({'Date': pd.date_range(start="2021-01-01",end="2021-04-29")})
pc21 = pd.DataFrame({'Date': ["2021-01-21", "2021-03-11", "2021-04-22"]})
pc21['Date'] = pd.to_datetime(pc21['Date'])
What I want to do is the following: for every date in pc21 and if the date in pc21 is in daily, I want to get, in a new column, values equal 1 for 8 days after the date and 0 otherwise.
This is an example of a desired output:
# 2021-01-21 is in either daframes so I want a new column in 'daily' that looks like this:
Date newcol
.
.
.
2021-01-20 0
2021-01-21 1
2021-01-22 1
2021-01-23 1
2021-01-24 1
2021-01-25 1
2021-01-26 1
2021-01-27 1
2021-01-28 1
2021-01-29 0
.
.
.
Can anyone help me achieve this?
Thanks!
you can try the following approach:
res = (daily
.merge(pd.concat([pd.date_range(d, freq="D", periods=8).to_frame(name="Date")
for d in pc21["Date"]]),
how="left", indicator=True)
.replace({"both": 1, "left_only":0})
.rename(columns={"_merge":"newcol"}))
result
In [15]: res
Out[15]:
Date newcol
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 0
4 2021-01-05 0
.. ... ...
114 2021-04-25 1
115 2021-04-26 1
116 2021-04-27 1
117 2021-04-28 1
118 2021-04-29 1
[119 rows x 2 columns]
daily['value'] = 0
pc21['value'] = 1
daily = pd.merge(daily, pc21, on='Date', how='left').rename(
columns={'value_y':'value'}).drop('value_x', 1).fillna(method="ffill", limit=7).fillna(0)
pc21.drop('value',1)
Output Subset
daily.query('value.eq(1)')
Date value
20 2021-01-21 1.0
21 2021-01-22 1.0
22 2021-01-23 1.0
23 2021-01-24 1.0
24 2021-01-25 1.0
25 2021-01-26 1.0
26 2021-01-27 1.0
27 2021-01-28 1.0
69 2021-03-11 1.0
daily["new_col"] = np.where(daily.Date.isin(pc21.Date), 1, np.nan)
daily["new_col"] = daily["new_col"].fillna(method="ffill", limit=7).fillna(0)
We generate the new column first:
If the Date of daily is in Date of pc21
then put 1
else
put a NaN
Then forward fill that column but with a limit of 7 so that we have 8 consecutive 1s
Lastly forward fill again the remaining NaNs with 0.
(you can put an astype(int) at the end to have integers).

Merge pandas df based on 2 keys

I have 2 df and I would like to merge them based on 2 keys - ID and date:
I following is just a small slice of the entire df
df_pw6
ID date pw10_0 pw50_0 pw90_0
0 153 2018-01-08 27.88590 43.2872 58.2024
0 2 2018-01-05 11.03610 21.4879 31.6997
0 506 2018-01-08 6.98468 25.3899 45.9486
df_ex
date ID measure f188 f187 f186 f185
0 2017-07-03 501 NaN 1 0.5 7 4.0
1 2017-07-03 502 NaN 0 2.5 5 3.0
2 2018-01-08 506 NaN 5 9.0 9 1.2
As you can see, only the third row has a match.
When I type:
#check date
df_ex.iloc[2,0]== df_pw6.iloc[1,1]
True
#check ID
df_ex.iloc[2,1] == df_pw6.iloc[2,0]
True
Now I try to merge them:
df19 = pd.merge(df_pw6,df_ex,on=['date','ID'])
I get an empty df
When I try:
df19 = pd.merge(df_pw6,df_ex,how ='left',on=['date','ID'])
I get:
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 153 2018-01-08 00:00:00 27.88590 43.2872 58.2024 NaN NaN NaN NaN NaN
1 2 2018-01-05 00:00:00 11.03610 21.4879 31.6997 NaN NaN NaN NaN NaN
2 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN NaN NaN NaN NaN
My desired result should be:
> ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
>
> 0 506 2018-01-08 00:00:00 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2
I have run your codes post your edit, and I succeeded in getting the desired result.
import pandas as pd
# copy paste your first df by hand
pw = pd.read_clipboard()
# copy paste your second df by hand
ex = pd.read_clipboard()
pd.merge(pw,ex,on=['date','ID'])
# output [edited. now it is the correct result OP wanted.]
ID date pw10_0 pw50_0 pw90_0 measure f188 f187 f186 f185
0 506 2018-01-08 6.98468 25.3899 45.9486 NaN 5 9.0 9 1.2

Categories