I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1
Related
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
I have the following pandas dataframe, the duration is espressed in Minutes:
Start Date Event Duration
2021.01.01 00:00 AM 2 540
2021.01.01 9:00 AM 1 180
2021.01.01 12:00 PM 2 20
2021.01.01 12:20 PM 1 1440
2021.01.02 12:20 PM 2 60
2021.01.02 1:20 PM 1 20
I would like to calculate the duration of each event for a single day. The problem is that there are some event like the one in line 3 that are across multiple days.
What I would like to obtain is something like this:
Date Event Duration
2021.01.01 1 880
2021.01.01 2 560
2021.01.02 1 760
2021.01.02 2 60
In general the sum of all events in a specific day cannot exceed 1440 which is 24 hours * 60 minutes. The event are continuous so there is alway an event, there are never times without events.
For some weird reasons I could not convert your dates right away but needed to replace whitespaces. Nonetheless, let’s start by converting your Date column to pandas dates and set it as an index:
>>> df['Start Date'] = pd.to_datetime(df['Start Date'].str.replace(r'\s+', ' ', regex=True))
>>> df = df.set_index('Start Date')
>>> df
Event Duration
2021-01-01 00:00:00 2 540
2021-01-01 09:00:00 1 180
2021-01-01 12:00:00 2 20
2021-01-01 12:20:00 1 1440
2021-01-02 12:20:00 2 60
2021-01-02 13:20:00 1 20
We can then compute which splits need to be done, aka timestamps where the day changes but that don’t appear as Start Date, and add those to the index:
>>> splits = pd.date_range(df.index.min().floor(freq='D') + pd.Timedelta(days=1), df.index.max().ceil(freq='D') - pd.Timedelta(days=1), freq='D')
>>> df = df.reindex(df.index.append(splits).drop_duplicates().sort_values())
>>> df
Event Duration
2021-01-01 00:00:00 2.0 540.0
2021-01-01 09:00:00 1.0 180.0
2021-01-01 12:00:00 2.0 20.0
2021-01-01 12:20:00 1.0 1440.0
2021-01-02 00:00:00 NaN NaN
2021-01-02 12:20:00 2.0 60.0
2021-01-02 13:20:00 1.0 20.0
At this point we know it’s the difference between indexes that’s the time we want. Fill in the blanks from Duration, then we can simply group by day/event and sum without any unexpected behaviour:
>>> minutes = df.index.to_series().diff().shift(-1).astype('timedelta64[m]').fillna(df['Duration'])
>>> minutes
2021-01-01 00:00:00 540.0
2021-01-01 09:00:00 180.0
2021-01-01 12:00:00 20.0
2021-01-01 12:20:00 700.0
2021-01-02 00:00:00 740.0
2021-01-02 12:20:00 60.0
2021-01-02 13:20:00 20.0
dtype: float64
>>> minutes.groupby([df.index.date, df['Event'].ffill()]).sum()
Event
2021-01-01 1.0 880.0
2.0 560.0
2021-01-02 1.0 760.0
2.0 60.0
dtype: float64
Note that we also made sure to propagate event ids to the split lines with .ffill()
This solution has the advantage of not generating huge dataframes with 1 entry per minute, and without limits on how many days can be contained in a single Duration value.
This is not the most elegant solution but it is a starting point
# convert to datetime
df['Start Date'] = pd.to_datetime(df['Start Date'])
# calculate the end date
df['End Date'] = df['Start Date'] + df['Duration'].apply(pd.Timedelta, unit='min')
# create a mask to filter your frame where start date is not the same as end date
same_day_mask = df['Start Date'].dt.date != df['End Date'].dt.date
# create two new frames
same_day_df = df[~same_day_mask].copy()
not_same_day_df = df[same_day_mask].copy()
# calculate the time it takes to get to midnight the next day
not_same_day_df['day1'] = (not_same_day_df['End Date'].dt.normalize() - not_same_day_df['Start Date']).dt.total_seconds()/60
# Calculate the remaining time from duration
not_same_day_df['day2'] = not_same_day_df['Duration'] - not_same_day_df['day1']
# reasign value to duration
not_same_day_df['Duration'] = not_same_day_df['day1']
not_same_day_df['Event2'] = not_same_day_df['Event']
new = not_same_day_df[['End Date', 'day2', 'Event2']].rename(columns={'End Date': 'Start Date',
'day2': 'Duration',
'Event2': 'Event'})
# append the data frames together
final_df = same_day_df.append(not_same_day_df[not_same_day_df.columns[:3]].append(new))
# groupby and sum
print(final_df.groupby([final_df['Start Date'].dt.normalize(), 'Event'])['Duration'].sum().reset_index())
Start Date Event Duration
0 2021-01-01 1 880.0
1 2021-01-01 2 560.0
2 2021-01-02 1 760.0
3 2021-01-02 2 60.0
You can do it by creating date ranges with pd.date_range, explode and groupby:
df["Start Date"] = pd.to_datetime(df["Start Date"])
df["TimeRange"] = [
pd.date_range(s, periods=m, freq="T")
for s, m in zip(df["Start Date"], df["Duration"])
]
df_out = (
df.explode("TimeRange")
.groupby(["Event", pd.Grouper(key="TimeRange", freq="D")]['Event']
.count().rename('Duration').reset_index()
)
df_out
Output:
Event TimeRange Duration
0 1 2021-01-01 880
1 1 2021-01-02 760
2 2 2021-01-01 560
3 2 2021-01-02 60
Create an record by minute starting with Start Date then count the records and groupby event and date.
have a look at DataFrame.groupby
for example you could calculate the sum of all durations on a day like this:
import pandas as pd
import io
df = """
Date Time Event Duration
2021.01.01 00:00 AM 2 540
2021.01.01 9:00 AM 1 180
2021.01.01 12:00 PM 2 20
2021.01.01 12:20 PM 1 1440
2021.01.02 12:20 PM 2 60
2021.01.02 1:20 PM 1 20
"""
df = df = pd.read_csv(io.StringIO(df), sep=r"\s+")
df.reset_index().groupby(["index", "Event"]).sum()
>>> Duration
>>> index Event
>>> 2021.01.01 1 1620
>>> 2 560
>>> 2021.01.02 1 20
>>> 2 60
Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31
Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31
In my CSV file I have a column with date and time with the format 6/1/2019 12:00:00 AM.
My requirement is to remove time from all rows, then row will have only date. After this I have to subtract all rows from base date 1/1/2019 so the row should have only number of days. Here for e.g if we subtract 6/1/2019 from 1/1/2019 the row will have the value 6.
I tried below code to remove time.
import pandas as pd
df = pd.read_csv('sample.csv', header = 0)
from datetime import datetime,date
df['date'] = pd.to_datetime(df['date']).dt.date
How to subtract date 1/1/2019 from each row in the column and get the days in number using pandas and python datetime library?
Remove times by Series.dt.floor from datetimes (convert them to 00:00:00) and subtract datetime, last convert output timedeltas to days by Series.dt.days:
df = pd.read_csv('sample.csv', header = 0, parse_dates=['date'])
df['days'] = df['date'].dt.floor('d').sub(pd.Timestamp('2019-01-01')).dt.days
Sample:
df = pd.DataFrame({'date': pd.date_range('2019-01-06 12:00:00', periods=10)})
df['days'] = df['date'].dt.floor('d').sub(pd.Timestamp('2019-01-01')).dt.days
print (df)
date days
0 2019-01-06 12:00:00 5
1 2019-01-07 12:00:00 6
2 2019-01-08 12:00:00 7
3 2019-01-09 12:00:00 8
4 2019-01-10 12:00:00 9
5 2019-01-11 12:00:00 10
6 2019-01-12 12:00:00 11
7 2019-01-13 12:00:00 12
8 2019-01-14 12:00:00 13
9 2019-01-15 12:00:00 14
I'm trying to use pandas to group subscribers by subscription type for a given day and get the average price of a subscription type on that day. The data I have resembles:
Sub_Date Sub_Type Price
2011-03-31 00:00:00 12 Month 331.00
2012-04-16 00:00:00 12 Month 334.70
2013-08-06 00:00:00 12 Month 344.34
2014-08-21 00:00:00 12 Month 362.53
2015-08-31 00:00:00 6 Month 289.47
2016-09-03 00:00:00 6 Month 245.57
2013-04-10 00:00:00 4 Month 148.79
2014-03-13 00:00:00 12 Month 348.46
2015-03-15 00:00:00 12 Month 316.86
2011-02-09 00:00:00 12 Month 333.25
2012-03-09 00:00:00 12 Month 333.88
...
2013-04-03 00:00:00 12 Month 318.34
2014-04-15 00:00:00 12 Month 350.73
2015-04-19 00:00:00 6 Month 291.63
2016-04-19 00:00:00 6 Month 247.35
2011-02-14 00:00:00 12 Month 333.25
2012-05-23 00:00:00 12 Month 317.77
2013-05-28 00:00:00 12 Month 328.16
2014-05-31 00:00:00 12 Month 360.02
2011-07-11 00:00:00 12 Month 335.00
...
I'm looking to get something that resembles:
Sub_Date Sub_type Quantity Price
2011-03-31 00:00:00 3 Month 2 125.00
4 Month 0 0.00 # Promo not available this month
6 Month 1 250.78
12 Month 2 334.70
2011-04-01 00:00:00 3 Month 2 125.00
4 Month 2 145.00
6 Month 0 250.78
12 Month 0 334.70
2013-04-02 00:00:00 3 Month 1 125.00
4 Month 3 145.00
6 Month 0 250.78
12 Month 1 334.70
...
2015-06-23 00:00:00 3 Month 4 135.12
4 Month 0 0.00 # Promo not available this month
6 Month 0 272.71
12 Month 3 354.12
...
I'm only able to get the total number of Sub_Types for a given date.
df.Sub_Date.groupby([df.Sub_Date.values.astype('datetime64[D]')]).size()
This is somewhat of a good start, but not exactly what is needed. I've had a look at the groupby documentation on the pandas site but I can't get the output I desire.
I think you need aggregate by mean and size and then add missing values by unstack with stack.
Also if need change order of level Sub_Type, use ordered categorical.
#generating all months ('1 Month','2 Month'...'12 Month')
cat = [str(x) + ' Month' for x in range(1,13)]
df.Sub_Type = df.Sub_Type.astype('category', categories=cat, ordered=True)
df1 = df.Price.groupby([df.Sub_Date.values.astype('datetime64[D]'), df.Sub_Type])
.agg(['mean', 'size'])
.rename(columns={'size':'Quantity','mean':'Price'})
.unstack(fill_value=0)
.stack()
print (df1)
Price Quantity
Sub_Type
2011-02-09 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-02-14 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-03-31 4 Month 0.00 0
6 Month 0.00 0
12 Month 331.00 1