aggregate by week of daily column - python

Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31

Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31

Related

New column for quarter of year from datetime col

I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1

Count number of occurences in past 14 days of certain value

I have a pandas dataframe with a date column and a id column. I would like to return the number of occurences the the id of each line, in the past 14 days prior to the corresponding date of each line. That means, I would like to return "1, 2, 1, 2, 3, 4, 1". How can I do this? Performance is important since the dataframe has a len of 200,000 rows or so. Thanks !
date
id
2021-01-01
1
2021-01-04
1
2021-01-05
2
2021-01-06
2
2021-01-07
1
2021-01-08
1
2021-01-28
1
Assuming the input is sorted by date, you can use a GroupBy.rolling approach:
# only required if date is not datetime type
df['date'] = pd.to_datetime(df['date'])
(df.assign(count=1)
.set_index('date')
.groupby('id')
.rolling('14d')['count'].sum()
.sort_index(level='date').reset_index() #optional if order is not important
)
output:
id date count
0 1 2021-01-01 1.0
1 1 2021-01-04 2.0
2 2 2021-01-05 1.0
3 2 2021-01-06 2.0
4 1 2021-01-07 3.0
5 1 2021-01-08 4.0
6 1 2021-01-28 1.0
I am not sure whether this is the best idea or not, but the code below is what I have come up with:
from datetime import timedelta
df["date"] = pd.to_datetime(df["date"])
newColumn = []
for index, row in df.iterrows():
endDate = row["date"]
startDate = endDate - timedelta(days=14)
id = row["id"]
summation = df[(df["date"] >= startDate) & (df["date"] <= endDate) & (df["id"] == id)]["id"].count()
newColumn.append(summation)
df["check_column"] = newColumn
df
Output
date
id
check_column
0
2021-01-01 00:00:00
1
1
1
2021-01-04 00:00:00
1
2
2
2021-01-05 00:00:00
2
1
3
2021-01-06 00:00:00
2
2
4
2021-01-07 00:00:00
1
3
5
2021-01-08 00:00:00
1
4
6
2021-01-28 00:00:00
1
1
Explanation
In this approach, I have used iterrows in order to loop over the dataframe's rows. Additionally, I have used timedelta in order to subtract 14 days from the date column.

Selecting dates with .gt()

I am trying to select the dates where the percentage change is over 1%. To do this my code is as follows:
df1 has 109 rows × 6 columns
`df1['Close'].pct_change().gt(0.01).index` produces:
DatetimeIndex(['2020-12-31', '2021-01-04', '2021-01-05', '2021-01-06',
'2021-01-07', '2021-01-08', '2021-01-11', '2021-01-12',
'2021-01-13', '2021-01-14',
...
'2021-05-25', '2021-05-26', '2021-05-27', '2021-05-28',
'2021-06-01', '2021-06-02', '2021-06-03', '2021-06-04',
'2021-06-07', '2021-06-08'],
dtype='datetime64[ns]', name='Date', length=109, freq=None)
This is not right because there are very few dates which are over 1% but i am still getting the same length of 109 as I would get without .gt().
Could you please advise why it is showing all the dates.
Select True values:
df1.loc[df1["Close"].pct_change().gt(1)].index
>>> df1
Open Close
Date
2021-01-01 5 7
2021-01-02 1 3
2021-01-03 1 2
2021-01-04 10 6
2021-01-05 5 10
2021-01-06 6 9
2021-01-07 8 1
2021-01-08 1 3
2021-01-09 10 5
2021-01-10 7 3
>>> df1.loc[df1["Close"].pct_change().gt(1)].index
DatetimeIndex(['2021-01-04', '2021-01-08'], dtype='datetime64[ns]', name='Date', freq=None)

Creating a DataFrame with a row for each date from date range in other DataFrame

Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'id':[1,2,3,4,5],
'start_date':['2021-01-01','2021-01-01','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-04','2021-01-03','2021-01-03','2021-01-06','2021-01-08']})
plan_dates
id start_date end_date
0 1 2021-01-01 2021-01-04
1 2 2021-01-01 2021-01-03
2 3 2021-01-03 2021-01-03
3 4 2021-01-04 2021-01-06
4 5 2021-01-05 2021-01-08
I would like to create a new DataFrame with a row for each day where the plan is active, for each id.
INTENDED DF:
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08
Any help would be greatly appreciated.
Use:
#first part is same like https://stackoverflow.com/a/66869805/2901002
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['start_date'] = df['start_date'].add(pd.to_timedelta(counter, unit='d'))
Then remove end_date column, rename and create default index:
df = (df.drop('end_date', axis=1)
.rename(columns={'start_date':'active_days'})
.reset_index(drop=True))
print (df)
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08

Pandas how to copy a column to another dataframe with similar index

I have one Pandas dataframe like below. I used pd.to_datetime(df['date']).dt.normalize() to get the date2 column to show just the date and ignore the time. Wasn't sure how to have it be just YYYY-MM-DD format.
date2 count compound_mean
0 2021-01-01 00:00:00+00:00 18 0.188411
1 2021-01-02 00:00:00+00:00 9 0.470400
2 2021-01-03 00:00:00+00:00 10 0.008190
3 2021-01-04 00:00:00+00:00 58 0.187510
4 2021-01-05 00:00:00+00:00 150 0.176173
Another dataframe with the following format.
Date Average
2021-01-04 18.200001
2021-01-05 22.080000
2021-01-06 22.250000
2021-01-07 22.260000
2021-01-08 21.629999
I want to have the Average column show up in the first dataframe by matching the dates and then forward-filling any blank values. From 01-01 to 01-03 there will be nothing to forward fill, so I guess it will end up being zero. I'm having trouble finding the right Pandas functions to do this, looking for some guidance. Thank you.
I assume your first dataframe to be df1 and second dataframe to be df2.
Firstly, you need to change the name of the date2 column of df1 to Date so that it matches with your Date column of df2.
df1['Date'] = pd.to_datetime(df1['date2']).dt.date
You can then remove the date2 column of df1 as
df1.drop("date2",inplace=True, axis=1)
You also need to change the column type of Date of df2 so that it matches with type of df1's Date column
df2['Date'] = pd.to_datetime(df2['Date']).dt.date
Then make a new dataframe which will contain both dataframe columns based on Date column.
main_df = pd.merge(df1,df2,on="Date", how="left")
df1['Average'] = main_df['Average']
df1 = pd.DataFrame(df1, columns = ['Date', 'count','compound_mean','Average'])
You can then fill the null values by ffill and also the first 3 null values by 0
df1.fillna(method='ffill', inplace=True)
df1.fillna(0, inplace=True)
Your first dataframe will look what you wanted
Try the following:
>>> df.index = pd.to_datetime(df.date2).dt.date
# If df.date2 is already datetime, use ^ df.index = df.date2.dt.date
>>> df2['Date'] = pd.to_datetime(df2['Date'])
# If df2['Date'] is already datetime, ^ this above line is not needed
>>> df.join(df2.set_index('Date')).fillna(0)
date2 count compound_mean Average
date2
2021-01-01 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
2021-01-02 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2021-01-03 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
2021-01-04 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
2021-01-05 2021-01-05 00:00:00+00:00 150 0.176173 22.080000
You can perform merge operation as follows:
#Making date of same UTC format from both tables
df1['date2'] = pd.to_datetime(df1['date2'],utc = True)
df2['Date'] = pd.to_datetime(df2['Date'],utc = True)
#Renaming df1 column so that we can map 'Date' from both dataframes
df1.rename(columns={'date2': 'Date'},inplace=True)
#Merge operation
res = pd.merge(df1,df2,on='Date',how='left').fillna(0)
Output:
Date count compound_mean Average
0 2021-01-01 00:00:00+00:00 18 0.188411 0.000000
1 2021-01-02 00:00:00+00:00 9 0.470400 0.000000
2 2021-01-03 00:00:00+00:00 10 0.008190 0.000000
3 2021-01-04 00:00:00+00:00 58 0.187510 18.200001
4 2021-01-05 00:00:00+00:00 150 0.176173 22.080000

Categories