Pandas groupby datetime, getting the count and price - python

I'm trying to use pandas to group subscribers by subscription type for a given day and get the average price of a subscription type on that day. The data I have resembles:
Sub_Date Sub_Type Price
2011-03-31 00:00:00 12 Month 331.00
2012-04-16 00:00:00 12 Month 334.70
2013-08-06 00:00:00 12 Month 344.34
2014-08-21 00:00:00 12 Month 362.53
2015-08-31 00:00:00 6 Month 289.47
2016-09-03 00:00:00 6 Month 245.57
2013-04-10 00:00:00 4 Month 148.79
2014-03-13 00:00:00 12 Month 348.46
2015-03-15 00:00:00 12 Month 316.86
2011-02-09 00:00:00 12 Month 333.25
2012-03-09 00:00:00 12 Month 333.88
...
2013-04-03 00:00:00 12 Month 318.34
2014-04-15 00:00:00 12 Month 350.73
2015-04-19 00:00:00 6 Month 291.63
2016-04-19 00:00:00 6 Month 247.35
2011-02-14 00:00:00 12 Month 333.25
2012-05-23 00:00:00 12 Month 317.77
2013-05-28 00:00:00 12 Month 328.16
2014-05-31 00:00:00 12 Month 360.02
2011-07-11 00:00:00 12 Month 335.00
...
I'm looking to get something that resembles:
Sub_Date Sub_type Quantity Price
2011-03-31 00:00:00 3 Month 2 125.00
4 Month 0 0.00 # Promo not available this month
6 Month 1 250.78
12 Month 2 334.70
2011-04-01 00:00:00 3 Month 2 125.00
4 Month 2 145.00
6 Month 0 250.78
12 Month 0 334.70
2013-04-02 00:00:00 3 Month 1 125.00
4 Month 3 145.00
6 Month 0 250.78
12 Month 1 334.70
...
2015-06-23 00:00:00 3 Month 4 135.12
4 Month 0 0.00 # Promo not available this month
6 Month 0 272.71
12 Month 3 354.12
...
I'm only able to get the total number of Sub_Types for a given date.
df.Sub_Date.groupby([df.Sub_Date.values.astype('datetime64[D]')]).size()
This is somewhat of a good start, but not exactly what is needed. I've had a look at the groupby documentation on the pandas site but I can't get the output I desire.

I think you need aggregate by mean and size and then add missing values by unstack with stack.
Also if need change order of level Sub_Type, use ordered categorical.
#generating all months ('1 Month','2 Month'...'12 Month')
cat = [str(x) + ' Month' for x in range(1,13)]
df.Sub_Type = df.Sub_Type.astype('category', categories=cat, ordered=True)
df1 = df.Price.groupby([df.Sub_Date.values.astype('datetime64[D]'), df.Sub_Type])
.agg(['mean', 'size'])
.rename(columns={'size':'Quantity','mean':'Price'})
.unstack(fill_value=0)
.stack()
print (df1)
Price Quantity
Sub_Type
2011-02-09 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-02-14 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-03-31 4 Month 0.00 0
6 Month 0.00 0
12 Month 331.00 1

Related

How to convert year, month, day, hour/minute columns into a single datetime column?

I have the following data format with different columns for year, month, day, and hour_minute (the first two digits are hour and the last two digits are minutes). How do I create a new column in datetime format by combining all of these existing columns?
YEAR
MONTH
DAY
HOUR_MINUTE
2015
1
15
0010
2015
1
2
0020
2015
1
15
0045
2015
1
15
2110
2015
10
21
2359
I have tried the following but have no luck. Thank you for your advise your advise.
df["new_column"]= pd.to_datetime(df[["YEAR", "MONTH", "DAY","HOUR_MINUTE"]])
You need to split HOUR_MINUTE column to HOUR and MINUTE
df["HOUR"] = df["HOUR_MINUTE"].str[0:2]
df["MINUTE"] = df.pop("HOUR_MINUTE").str[2:4]
df["new_column"] = pd.to_datetime(df[["YEAR", "MONTH", "DAY", "HOUR", "MINUTE"]], format="%Y-%m-%d %H:%M")
print(df)
Output:
YEAR MONTH DAY HOUR MINUTE new_column
0 2015 1 15 00 10 2015-01-15 00:10:00
1 2015 1 2 00 20 2015-01-02 00:20:00
2 2015 1 15 00 45 2015-01-15 00:45:00
3 2015 1 15 21 10 2015-01-15 21:10:00
4 2015 10 21 23 59 2015-10-21 23:59:00
You can apply on entire df if you have only year,month and hour_minute columns like this
df.apply(lambda row: pd.to_datetime(''.join(row.values.astype(str)), format="%Y%m%d%H%M") ,axis=1)
Out[198]:
0 2015-11-05 00:10:00
1 2015-01-20 02:00:00
2 2015-11-05 04:05:00
3 2015-11-05 21:10:00
4 2015-10-21 23:59:00
dtype: datetime64[ns]
if there are other columns as well then just select the required columns then apply
df[['YEAR', 'MONTH', 'DAY', 'HOUR_MINUTE']].apply(lambda row: pd.to_datetime(''.join(row.values.astype(str)), format="%Y%m%d%H%M") ,axis=1)
Out[201]:
0 2015-11-05 00:10:00
1 2015-01-20 02:00:00
2 2015-11-05 04:05:00
3 2015-11-05 21:10:00
4 2015-10-21 23:59:00
dtype: datetime64[ns]
if you want new_column to be assigned to df then
df['new_column'] = df[['YEAR', 'MONTH', 'DAY', 'HOUR_MINUTE']].apply(lambda row: pd.to_datetime(''.join(row.values.astype(str)), format="%Y%m%d%H%M") ,axis=1)
df
Out[205]:
YEAR MONTH DAY HOUR_MINUTE new_column
0 2015 1 15 0010 2015-11-05 00:10:00
1 2015 1 2 0020 2015-01-20 02:00:00
2 2015 1 15 45 2015-11-05 04:05:00
3 2015 1 15 2110 2015-11-05 21:10:00
4 2015 10 21 2359 2015-10-21 23:59:00
Suggested script
import pandas as pd
df1 = pd.DataFrame({'YEAR': ['2015', '2015', '2015', '2015', '2015'],
'MONTH': ['1', '1', '1', '1', '10'],
'DAY': ['15', '2', '15', '15', '21'],
'HOUR_MINUTE': ['0010', '0020', '0045', '2110', '2359']
})
df1['FMT'] = df1.agg('-'.join(['{0[%s]}'%c for c in df1.columns]).format, axis=1)
df1['FMT'] = pd.to_datetime(df1['FMT'], format='%Y-%m-%d-%H%M')
print(df1)
Output
YEAR MONTH DAY HOUR_MINUTE FMT
0 2015 1 15 0010 2015-01-15 00:10:00
1 2015 1 2 0020 2015-01-02 00:20:00
2 2015 1 15 0045 2015-01-15 00:45:00
3 2015 1 15 2110 2015-01-15 21:10:00
4 2015 10 21 2359 2015-10-21 23:59:00

New column for quarter of year from datetime col

I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1

datetime hour component to column python pandas

I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.

Create repetative rows by iterating between dates in two column

I have three column in a data frame
ID - A001
DoA - 15-03-2014 - Date of Admission
DoL - 17-08-2020 - Date of Leaving
Create three new column:
Cal_Yr - Calender Year
Str_Date - Start of Date
End_Date - End of Date
If the year of admission is less than 2015 than
Str_Date = 01-01-2015 else DoA
End_Date = 15-03-2015
I am dividing the year in two parts ... One part before anniversary date( start dd-mm of the year) and other part after anniversary date so that I can find weight of both parts ... but the date before 01-01-2015 should be revauled as 01-01-2015
I have to design a loop which create repetative 12 rows as shown in figure.
input table is:
ID
DoA
status
DoL
Duration(years)
fee amt
A23
02-Jan-16
DH
18-Aug-18
2
2345
B23
01-Mar-09
IS
31-Dec-20
11
1000
C23
16-Sep-12
SU
12-Jul-19
7
14565
D23
01-Jun-20
LA
07-Sep-20
0
123
E23
15-Sep-16
IS
31-Dec-20
4
6790
F23
01-Jan-19
IS
31-Dec-20
1
7272
This does what you want. This is not a hard job; like most similar tasks, you just have to take it step by step. "What do I know here", "what information do I need here"? Note that I have converted to datetime.date objects for the dates, assuming you will want to do some analyses based on the dates.
import pandas as pd
import datetime
data = [
[ "A001", "15-03-2014", "17-08-2020" ],
[ "A002", "01-06-2018", "01-06-2020" ]
]
rows = []
for id, stdate, endate in data:
s = stdate.split('-')
startdate = datetime.date(int(s[2]),int(s[1]),int(s[0]))
s = endate.split('-')
enddate = datetime.date(int(s[2]),int(s[1]),int(s[0]))
for year in range(startdate.year, enddate.year + 1 ):
start1 = datetime.date(year,1,1)
anniv = datetime.date(year,startdate.month,startdate.day)
end1 = datetime.date(year,12,31)
if year != startdate.year:
rows.append( [id, year, start1, anniv] )
if anniv == enddate:
break
if year != enddate.year:
rows.append( [id, year, anniv, end1] )
elif anniv < enddate:
rows.append( [id, year, anniv, enddate] )
df = pd.DataFrame( rows, columns=["ID", "Cal_Yr", "Str_date", "End_date"] )
print( df )
Output:
ID Cal_Yr Str_date End_date
0 A001 2014 2014-03-15 2014-12-31
1 A001 2015 2015-01-01 2015-03-15
2 A001 2015 2015-03-15 2015-12-31
3 A001 2016 2016-01-01 2016-03-15
4 A001 2016 2016-03-15 2016-12-31
5 A001 2017 2017-01-01 2017-03-15
6 A001 2017 2017-03-15 2017-12-31
7 A001 2018 2018-01-01 2018-03-15
8 A001 2018 2018-03-15 2018-12-31
9 A001 2019 2019-01-01 2019-03-15
10 A001 2019 2019-03-15 2019-12-31
11 A001 2020 2020-01-01 2020-03-15
12 A001 2020 2020-03-15 2020-08-17
13 A002 2018 2018-06-01 2018-12-31
14 A002 2019 2019-01-01 2019-06-01
15 A002 2019 2019-06-01 2019-12-31
16 A002 2020 2020-01-01 2020-06-01

Convert 3 columns from dataframe to date

I have dataframe like this:
I want to convert the 'start_year', 'start_month', 'start_day' columns to date
and the columns 'end_year', 'end_month', 'end_day' to another date
There is a way to do that?
Thank you.
Given a dataframe like this:
year month day
0 2019.0 12.0 29.0
1 2020.0 9.0 15.0
2 2018.0 3.0 1.0
You can convert them to date string using type cast, and str.zfill:
OUTPUT:
df.apply(lambda x: f'{int(x["year"])}-{str(int(x["month"])).zfill(2)}-{str(int(x["day"])).zfill(2)}', axis=1)
0 2019-12-29
1 2020-09-15
2 2018-03-01
dtype: object
Here's an approach
simulate some data as your data was an image
use apply against each row to row series using datetime.datetime()
import datetime as dt
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"start_year": np.random.choice(range(2018, 2022), 10),
"start_month": np.random.choice(range(1, 13), 10),
"start_day": np.random.choice(range(1, 28), 10),
"end_year": np.random.choice(range(2018, 2022), 10),
"end_month": np.random.choice(range(1, 13), 10),
"end_day": np.random.choice(range(1, 28), 10),
}
)
df = df.apply(
lambda r: r.append(pd.Series({f"{startend}_date": dt.datetime(*(r[f"{startend}_{part}"]
for part in ["year", "month", "day"]))
for startend in ["start", "end"]})),
axis=1)
df
start_year
start_month
start_day
end_year
end_month
end_day
start_date
end_date
0
2018
9
6
2020
1
3
2018-09-06 00:00:00
2020-01-03 00:00:00
1
2018
11
6
2020
7
2
2018-11-06 00:00:00
2020-07-02 00:00:00
2
2021
8
13
2020
11
2
2021-08-13 00:00:00
2020-11-02 00:00:00
3
2021
3
15
2021
3
6
2021-03-15 00:00:00
2021-03-06 00:00:00
4
2019
4
13
2021
11
5
2019-04-13 00:00:00
2021-11-05 00:00:00
5
2021
2
5
2018
8
17
2021-02-05 00:00:00
2018-08-17 00:00:00
6
2020
4
19
2020
9
18
2020-04-19 00:00:00
2020-09-18 00:00:00
7
2020
3
27
2020
10
20
2020-03-27 00:00:00
2020-10-20 00:00:00
8
2019
12
23
2018
5
11
2019-12-23 00:00:00
2018-05-11 00:00:00
9
2021
7
18
2018
5
10
2021-07-18 00:00:00
2018-05-10 00:00:00
An interesting feature of pandasonic to_datetime function is that instead of
a sequence of strings you can pass to it a whole DataFrame.
But in this case there is a requirement that such a DataFrame must have columns
named year, month and day. They can be also of float type, like your source
DataFrame sample.
So a quite elegant solution is to:
take a part of the source DataFrame (3 columns with the respective year,
month and day),
rename its columns to year, month and day,
use it as the argument to to_datetime,
save the result as a new column.
To do it, start from defining a lambda function, to be used as the rename
function below:
colNames = lambda x: x.split('_')[1]
Then just call:
df['Start'] = pd.to_datetime(df.loc[:, 'start_year' : 'start_day']
.rename(columns=colNames))
df['End'] = pd.to_datetime(df.loc[:, 'end_year' : 'end_day']
.rename(columns=colNames))
For a sample of your source DataFrame, the result is:
start_year start_month start_day evidence_method_dating end_year end_month end_day Start End
0 2019.0 12.0 9.0 Historical Observations 2019.0 12.0 9.0 2019-12-09 2019-12-09
1 2019.0 2.0 18.0 Historical Observations 2019.0 7.0 28.0 2019-02-18 2019-07-28
2 2018.0 7.0 3.0 Seismicity 2019.0 8.0 20.0 2018-07-03 2019-08-20
Maybe the next part should be to remove columns with parts of both "start"
and "end" dates. Your choice.
Edit
To avoid saving the lambda (anonymous) function under a variable, define
this function as a regular (named) function:
def colNames(x):
return x.split('_')[1]

Categories