New column for quarter of year from datetime col

New column for quarter of year from datetime col - python

I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks

You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01

Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01

In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1

Related

datetime hour component to column python pandas

I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success

Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.

Calculate Events Duration in a day using pandas

I have the following pandas dataframe, the duration is espressed in Minutes:
Start Date Event Duration
2021.01.01 00:00 AM 2 540
2021.01.01 9:00 AM 1 180
2021.01.01 12:00 PM 2 20
2021.01.01 12:20 PM 1 1440
2021.01.02 12:20 PM 2 60
2021.01.02 1:20 PM 1 20
I would like to calculate the duration of each event for a single day. The problem is that there are some event like the one in line 3 that are across multiple days.
What I would like to obtain is something like this:
Date Event Duration
2021.01.01 1 880
2021.01.01 2 560
2021.01.02 1 760
2021.01.02 2 60
In general the sum of all events in a specific day cannot exceed 1440 which is 24 hours * 60 minutes. The event are continuous so there is alway an event, there are never times without events.

For some weird reasons I could not convert your dates right away but needed to replace whitespaces. Nonetheless, let’s start by converting your Date column to pandas dates and set it as an index:
>>> df['Start Date'] = pd.to_datetime(df['Start Date'].str.replace(r'\s+', ' ', regex=True))
>>> df = df.set_index('Start Date')
>>> df
Event Duration
2021-01-01 00:00:00 2 540
2021-01-01 09:00:00 1 180
2021-01-01 12:00:00 2 20
2021-01-01 12:20:00 1 1440
2021-01-02 12:20:00 2 60
2021-01-02 13:20:00 1 20
We can then compute which splits need to be done, aka timestamps where the day changes but that don’t appear as Start Date, and add those to the index:
>>> splits = pd.date_range(df.index.min().floor(freq='D') + pd.Timedelta(days=1), df.index.max().ceil(freq='D') - pd.Timedelta(days=1), freq='D')
>>> df = df.reindex(df.index.append(splits).drop_duplicates().sort_values())
>>> df
Event Duration
2021-01-01 00:00:00 2.0 540.0
2021-01-01 09:00:00 1.0 180.0
2021-01-01 12:00:00 2.0 20.0
2021-01-01 12:20:00 1.0 1440.0
2021-01-02 00:00:00 NaN NaN
2021-01-02 12:20:00 2.0 60.0
2021-01-02 13:20:00 1.0 20.0
At this point we know it’s the difference between indexes that’s the time we want. Fill in the blanks from Duration, then we can simply group by day/event and sum without any unexpected behaviour:
>>> minutes = df.index.to_series().diff().shift(-1).astype('timedelta64[m]').fillna(df['Duration'])
>>> minutes
2021-01-01 00:00:00 540.0
2021-01-01 09:00:00 180.0
2021-01-01 12:00:00 20.0
2021-01-01 12:20:00 700.0
2021-01-02 00:00:00 740.0
2021-01-02 12:20:00 60.0
2021-01-02 13:20:00 20.0
dtype: float64
>>> minutes.groupby([df.index.date, df['Event'].ffill()]).sum()
Event
2021-01-01 1.0 880.0
2.0 560.0
2021-01-02 1.0 760.0
2.0 60.0
dtype: float64
Note that we also made sure to propagate event ids to the split lines with .ffill()
This solution has the advantage of not generating huge dataframes with 1 entry per minute, and without limits on how many days can be contained in a single Duration value.

This is not the most elegant solution but it is a starting point
# convert to datetime
df['Start Date'] = pd.to_datetime(df['Start Date'])
# calculate the end date
df['End Date'] = df['Start Date'] + df['Duration'].apply(pd.Timedelta, unit='min')
# create a mask to filter your frame where start date is not the same as end date
same_day_mask = df['Start Date'].dt.date != df['End Date'].dt.date
# create two new frames
same_day_df = df[~same_day_mask].copy()
not_same_day_df = df[same_day_mask].copy()
# calculate the time it takes to get to midnight the next day
not_same_day_df['day1'] = (not_same_day_df['End Date'].dt.normalize() - not_same_day_df['Start Date']).dt.total_seconds()/60
# Calculate the remaining time from duration
not_same_day_df['day2'] = not_same_day_df['Duration'] - not_same_day_df['day1']
# reasign value to duration
not_same_day_df['Duration'] = not_same_day_df['day1']
not_same_day_df['Event2'] = not_same_day_df['Event']
new = not_same_day_df[['End Date', 'day2', 'Event2']].rename(columns={'End Date': 'Start Date',
'day2': 'Duration',
'Event2': 'Event'})
# append the data frames together
final_df = same_day_df.append(not_same_day_df[not_same_day_df.columns[:3]].append(new))
# groupby and sum
print(final_df.groupby([final_df['Start Date'].dt.normalize(), 'Event'])['Duration'].sum().reset_index())
Start Date Event Duration
0 2021-01-01 1 880.0
1 2021-01-01 2 560.0
2 2021-01-02 1 760.0
3 2021-01-02 2 60.0

You can do it by creating date ranges with pd.date_range, explode and groupby:
df["Start Date"] = pd.to_datetime(df["Start Date"])
df["TimeRange"] = [
pd.date_range(s, periods=m, freq="T")
for s, m in zip(df["Start Date"], df["Duration"])
]
df_out = (
df.explode("TimeRange")
.groupby(["Event", pd.Grouper(key="TimeRange", freq="D")]['Event']
.count().rename('Duration').reset_index()
)
df_out
Output:
Event TimeRange Duration
0 1 2021-01-01 880
1 1 2021-01-02 760
2 2 2021-01-01 560
3 2 2021-01-02 60
Create an record by minute starting with Start Date then count the records and groupby event and date.

have a look at DataFrame.groupby
for example you could calculate the sum of all durations on a day like this:
import pandas as pd
import io
df = """
Date Time Event Duration
2021.01.01 00:00 AM 2 540
2021.01.01 9:00 AM 1 180
2021.01.01 12:00 PM 2 20
2021.01.01 12:20 PM 1 1440
2021.01.02 12:20 PM 2 60
2021.01.02 1:20 PM 1 20
"""
df = df = pd.read_csv(io.StringIO(df), sep=r"\s+")
df.reset_index().groupby(["index", "Event"]).sum()
>>> Duration
>>> index Event
>>> 2021.01.01 1 1620
>>> 2 560
>>> 2021.01.02 1 20
>>> 2 60

aggregate by week of daily column

Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31

Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31

In pandas subtract date from series of date to get series of number of days

In my CSV file I have a column with date and time with the format 6/1/2019 12:00:00 AM.
My requirement is to remove time from all rows, then row will have only date. After this I have to subtract all rows from base date 1/1/2019 so the row should have only number of days. Here for e.g if we subtract 6/1/2019 from 1/1/2019 the row will have the value 6.
I tried below code to remove time.
import pandas as pd
df = pd.read_csv('sample.csv', header = 0)
from datetime import datetime,date
df['date'] = pd.to_datetime(df['date']).dt.date
How to subtract date 1/1/2019 from each row in the column and get the days in number using pandas and python datetime library?

Remove times by Series.dt.floor from datetimes (convert them to 00:00:00) and subtract datetime, last convert output timedeltas to days by Series.dt.days:
df = pd.read_csv('sample.csv', header = 0, parse_dates=['date'])
df['days'] = df['date'].dt.floor('d').sub(pd.Timestamp('2019-01-01')).dt.days
Sample:
df = pd.DataFrame({'date': pd.date_range('2019-01-06 12:00:00', periods=10)})
df['days'] = df['date'].dt.floor('d').sub(pd.Timestamp('2019-01-01')).dt.days
print (df)
date days
0 2019-01-06 12:00:00 5
1 2019-01-07 12:00:00 6
2 2019-01-08 12:00:00 7
3 2019-01-09 12:00:00 8
4 2019-01-10 12:00:00 9
5 2019-01-11 12:00:00 10
6 2019-01-12 12:00:00 11
7 2019-01-13 12:00:00 12
8 2019-01-14 12:00:00 13
9 2019-01-15 12:00:00 14

Pandas groupby datetime, getting the count and price

I'm trying to use pandas to group subscribers by subscription type for a given day and get the average price of a subscription type on that day. The data I have resembles:
Sub_Date Sub_Type Price
2011-03-31 00:00:00 12 Month 331.00
2012-04-16 00:00:00 12 Month 334.70
2013-08-06 00:00:00 12 Month 344.34
2014-08-21 00:00:00 12 Month 362.53
2015-08-31 00:00:00 6 Month 289.47
2016-09-03 00:00:00 6 Month 245.57
2013-04-10 00:00:00 4 Month 148.79
2014-03-13 00:00:00 12 Month 348.46
2015-03-15 00:00:00 12 Month 316.86
2011-02-09 00:00:00 12 Month 333.25
2012-03-09 00:00:00 12 Month 333.88
...
2013-04-03 00:00:00 12 Month 318.34
2014-04-15 00:00:00 12 Month 350.73
2015-04-19 00:00:00 6 Month 291.63
2016-04-19 00:00:00 6 Month 247.35
2011-02-14 00:00:00 12 Month 333.25
2012-05-23 00:00:00 12 Month 317.77
2013-05-28 00:00:00 12 Month 328.16
2014-05-31 00:00:00 12 Month 360.02
2011-07-11 00:00:00 12 Month 335.00
...
I'm looking to get something that resembles:
Sub_Date Sub_type Quantity Price
2011-03-31 00:00:00 3 Month 2 125.00
4 Month 0 0.00 # Promo not available this month
6 Month 1 250.78
12 Month 2 334.70
2011-04-01 00:00:00 3 Month 2 125.00
4 Month 2 145.00
6 Month 0 250.78
12 Month 0 334.70
2013-04-02 00:00:00 3 Month 1 125.00
4 Month 3 145.00
6 Month 0 250.78
12 Month 1 334.70
...
2015-06-23 00:00:00 3 Month 4 135.12
4 Month 0 0.00 # Promo not available this month
6 Month 0 272.71
12 Month 3 354.12
...
I'm only able to get the total number of Sub_Types for a given date.
df.Sub_Date.groupby([df.Sub_Date.values.astype('datetime64[D]')]).size()
This is somewhat of a good start, but not exactly what is needed. I've had a look at the groupby documentation on the pandas site but I can't get the output I desire.

I think you need aggregate by mean and size and then add missing values by unstack with stack.
Also if need change order of level Sub_Type, use ordered categorical.
#generating all months ('1 Month','2 Month'...'12 Month')
cat = [str(x) + ' Month' for x in range(1,13)]
df.Sub_Type = df.Sub_Type.astype('category', categories=cat, ordered=True)
df1 = df.Price.groupby([df.Sub_Date.values.astype('datetime64[D]'), df.Sub_Type])
.agg(['mean', 'size'])
.rename(columns={'size':'Quantity','mean':'Price'})
.unstack(fill_value=0)
.stack()
print (df1)
Price Quantity
Sub_Type
2011-02-09 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-02-14 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-03-31 4 Month 0.00 0
6 Month 0.00 0
12 Month 331.00 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

New column for quarter of year from datetime col - python

I have a column below as date 2019-05-11 2019-11-11 2020-03-01 2021-02-18 How can I create a new column that is the same format but by quarter? Expected output date | quarter 2019-05-11 2019-04-01 2019-11-11 2019-10-01 2020-03-01 2020-01-01 2021-02-18 2021-01-01 Thanks

You can use pandas.PeriodIndex : df['date'] = pd.to_datetime(df['date']) df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp() # Output : print(df) date quarter 0 2019-05-11 2019-04-01 1 2019-11-11 2019-10-01 2 2020-03-01 2020-01-01 3 2021-02-18 2021-01-01

In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month. Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10. You can use the integer division (//) to achieve this. n = month quarter = ( (n-1) // 3 ) * 3 + 1

Related

datetime hour component to column python pandas

Calculate Events Duration in a day using pandas

aggregate by week of daily column

In pandas subtract date from series of date to get series of number of days

Pandas groupby datetime, getting the count and price

Categories

Resources