I have a pandas dataframe with a date column and a id column. I would like to return the number of occurences the the id of each line, in the past 14 days prior to the corresponding date of each line. That means, I would like to return "1, 2, 1, 2, 3, 4, 1". How can I do this? Performance is important since the dataframe has a len of 200,000 rows or so. Thanks !
date
id
2021-01-01
1
2021-01-04
1
2021-01-05
2
2021-01-06
2
2021-01-07
1
2021-01-08
1
2021-01-28
1
Assuming the input is sorted by date, you can use a GroupBy.rolling approach:
# only required if date is not datetime type
df['date'] = pd.to_datetime(df['date'])
(df.assign(count=1)
.set_index('date')
.groupby('id')
.rolling('14d')['count'].sum()
.sort_index(level='date').reset_index() #optional if order is not important
)
output:
id date count
0 1 2021-01-01 1.0
1 1 2021-01-04 2.0
2 2 2021-01-05 1.0
3 2 2021-01-06 2.0
4 1 2021-01-07 3.0
5 1 2021-01-08 4.0
6 1 2021-01-28 1.0
I am not sure whether this is the best idea or not, but the code below is what I have come up with:
from datetime import timedelta
df["date"] = pd.to_datetime(df["date"])
newColumn = []
for index, row in df.iterrows():
endDate = row["date"]
startDate = endDate - timedelta(days=14)
id = row["id"]
summation = df[(df["date"] >= startDate) & (df["date"] <= endDate) & (df["id"] == id)]["id"].count()
newColumn.append(summation)
df["check_column"] = newColumn
df
Output
date
id
check_column
0
2021-01-01 00:00:00
1
1
1
2021-01-04 00:00:00
1
2
2
2021-01-05 00:00:00
2
1
3
2021-01-06 00:00:00
2
2
4
2021-01-07 00:00:00
1
3
5
2021-01-08 00:00:00
1
4
6
2021-01-28 00:00:00
1
1
Explanation
In this approach, I have used iterrows in order to loop over the dataframe's rows. Additionally, I have used timedelta in order to subtract 14 days from the date column.
Related
Given this dataframe df:
date type target
2021-01-01 0 5
2021-01-01 0 6
2021-01-01 1 4
2021-01-01 1 2
2021-01-02 0 5
2021-01-02 1 3
2021-01-02 1 7
2021-01-02 0 1
2021-01-03 0 2
2021-01-03 1 5
I want to create a new column that contains yesterday's target mean by type.
For example, for the 5th row (date=2021-01-02, type=0) the new column's value would be 5.5, as the mean of the target for the previous day, 2021-01-01 for type=0 is (5+6)/2.
I can easily obtain the mean of target grouping by date and type as:
means = df.groupby(['date', 'type'])['target'].mean()
But I don't know how to create a new column on the original dataframe with the desired data, which should look as follows:
date type target mean
2021-01-01 0 5 NaN (or null or whatever)
2021-01-01 0 6 NaN
2021-01-01 1 4 NaN
2021-01-01 1 2 NaN
2021-01-02 0 5 5.5
2021-01-02 1 3 3
2021-01-02 1 7 3
2021-01-02 0 2 5.5
2021-01-03 0 2 3.5
2021-01-03 1 5 5
Ensure your date column is datetime, and add another temporary column to df of the date the day before:
df['date'] = pd.to_datetime(df['date'])
df['yesterday'] = df['date'] - pd.Timedelta('1 day')
Then use your means groupby, with as_index=False, and left merge that onto the original df on yesterday/date and type columns, and select the desired columns:
means = df.groupby(['date', 'type'], as_index=False)['target'].mean()
df.merge(means, left_on=['yesterday', 'type'], right_on=['date', 'type'],
how='left', suffixes=[None, ' mean'])[['date', 'type', 'target', 'target mean']]
Output:
date type target target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Idea is add one day to first level of MultiIndex Series by Timedelta, so possible add new column by DataFrame.join:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby(['date', 'type'])['target'].mean()
s2 = s1.rename(index=lambda x: x + pd.Timedelta(days=1), level=0)
df = df.join(s2.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Another solution:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby([df['date'] + pd.Timedelta(days=1), 'type'])['target'].mean()
df = df.join(s1.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
A small edition on #Emi OB' s answer
means = df.groupby(["date", "type"], as_index=False)["target"].mean()
means["mean"] = means.pop("target").shift(2)
df = df.merge(means, how="left", on=["date", "type"])
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 2 5.5
8 2021-01-03 0 2 3.5
9 2021-01-03 1 5 5.0
I have data that looks like this: (assume start and end are date times)
id
start
end
1
01-01
01-02
1
01-03
01-05
1
01-04
01-07
1
01-06
NaT
1
01-07
NaT
I want to get a data frame that would include all dates, that has a 'cumulative sum' that only counts for the range they are in.
dates
count
01-01
1
01-02
0
01-03
1
01-04
2
01-05
1
01-06
2
01-07
3
One idea I thought of was simply using cumcount on the start dates, and doing a 'reverse cumcount' decreasing the counts using the end dates, but I am having trouble wrapping my head around doing this in pandas and I'm wondering whether there's a more elegant solution.
Here is two options. first consider this data with only one id, note that your columns start and end must be datetime.
d = {'id': [1, 1, 1, 1, 1],
'start': [pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-03'),
pd.Timestamp('2021-01-04'), pd.Timestamp('2021-01-06'),
pd.Timestamp('2021-01-07')],
'end': [pd.Timestamp('2021-01-02'), pd.Timestamp('2021-01-05'),
pd.Timestamp('2021-01-07'), pd.NaT, pd.NaT]}
df = pd.DataFrame(d)
so to get your result, you can do a sub between the get_dummies of start and end. then sum if several start and or end at the same dates, cumsum along the dates, reindex to get all the dates between the min and max dates available. create a function.
def dates_cc(df_):
return (
pd.get_dummies(df_['start'])
.sub(pd.get_dummies(df_['end'], dtype=int), fill_value=0)
.sum()
.cumsum()
.to_frame(name='count')
.reindex(pd.date_range(df_['start'].min(), df_['end'].max()), method='ffill')
.rename_axis('dates')
)
Now you can apply this function to your dataframe
res = dates_cc(df).reset_index()
print(res)
# dates count
# 0 2021-01-01 1.0
# 1 2021-01-02 0.0
# 2 2021-01-03 1.0
# 3 2021-01-04 2.0
# 4 2021-01-05 1.0
# 5 2021-01-06 2.0
# 6 2021-01-07 2.0
Now if you have several id, like
df1 = df.assign(id=[1,1,2,2,2])
print(df1)
# id start end
# 0 1 2021-01-01 2021-01-02
# 1 1 2021-01-03 2021-01-05
# 2 2 2021-01-04 2021-01-07
# 3 2 2021-01-06 NaT
# 4 2 2021-01-07 NaT
then you can use the above function like
res1 = df1.groupby('id').apply(dates_cc).reset_index()
print(res1)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 2 2021-01-04 1.0
# 6 2 2021-01-05 1.0
# 7 2 2021-01-06 2.0
# 8 2 2021-01-07 2.0
that said, a more straightforward possibility is with crosstab that create a row per id, the rest is about the same manipulations.
res2 = (
pd.crosstab(index=df1['id'], columns=df1['start'])
.sub(pd.crosstab(index=df1['id'], columns=df1['end']), fill_value=0)
.reindex(columns=pd.date_range(df1['start'].min(), df1['end'].max()), fill_value=0)
.rename_axis(columns='dates')
.cumsum(axis=1)
.stack()
.reset_index(name='count')
)
print(res2)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 1 2021-01-06 0.0
# 6 1 2021-01-07 0.0
# 7 2 2021-01-01 0.0
# 8 2 2021-01-02 0.0
# 9 2 2021-01-03 0.0
# 10 2 2021-01-04 1.0
# 11 2 2021-01-05 1.0
# 12 2 2021-01-06 2.0
# 13 2 2021-01-07 2.0
the main difference between the two options is that this one create extra dates for each id, because for example 2021-01-01 is in id=1 but not id=2 and with this version, you get this date also for id=2 while in groupby it is not taken into account.
I have two dataframes:
daily = pd.DataFrame({'Date': pd.date_range(start="2021-01-01",end="2021-04-29")})
pc21 = pd.DataFrame({'Date': ["2021-01-21", "2021-03-11", "2021-04-22"]})
pc21['Date'] = pd.to_datetime(pc21['Date'])
What I want to do is the following: for every date in pc21 and if the date in pc21 is in daily, I want to get, in a new column, values equal 1 for 8 days after the date and 0 otherwise.
This is an example of a desired output:
# 2021-01-21 is in either daframes so I want a new column in 'daily' that looks like this:
Date newcol
.
.
.
2021-01-20 0
2021-01-21 1
2021-01-22 1
2021-01-23 1
2021-01-24 1
2021-01-25 1
2021-01-26 1
2021-01-27 1
2021-01-28 1
2021-01-29 0
.
.
.
Can anyone help me achieve this?
Thanks!
you can try the following approach:
res = (daily
.merge(pd.concat([pd.date_range(d, freq="D", periods=8).to_frame(name="Date")
for d in pc21["Date"]]),
how="left", indicator=True)
.replace({"both": 1, "left_only":0})
.rename(columns={"_merge":"newcol"}))
result
In [15]: res
Out[15]:
Date newcol
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 0
4 2021-01-05 0
.. ... ...
114 2021-04-25 1
115 2021-04-26 1
116 2021-04-27 1
117 2021-04-28 1
118 2021-04-29 1
[119 rows x 2 columns]
daily['value'] = 0
pc21['value'] = 1
daily = pd.merge(daily, pc21, on='Date', how='left').rename(
columns={'value_y':'value'}).drop('value_x', 1).fillna(method="ffill", limit=7).fillna(0)
pc21.drop('value',1)
Output Subset
daily.query('value.eq(1)')
Date value
20 2021-01-21 1.0
21 2021-01-22 1.0
22 2021-01-23 1.0
23 2021-01-24 1.0
24 2021-01-25 1.0
25 2021-01-26 1.0
26 2021-01-27 1.0
27 2021-01-28 1.0
69 2021-03-11 1.0
daily["new_col"] = np.where(daily.Date.isin(pc21.Date), 1, np.nan)
daily["new_col"] = daily["new_col"].fillna(method="ffill", limit=7).fillna(0)
We generate the new column first:
If the Date of daily is in Date of pc21
then put 1
else
put a NaN
Then forward fill that column but with a limit of 7 so that we have 8 consecutive 1s
Lastly forward fill again the remaining NaNs with 0.
(you can put an astype(int) at the end to have integers).
Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31
Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31
I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01