Right way to use groupby resample aggregate function - python

I have some data which I'm trying to groupby "name" first and then resample by "transaction_date"
transaction_date name revenue
01/01/2020 ADIB 30419
01/01/2020 ADIB 1119372
01/01/2020 ADIB 1272170
01/01/2020 ADIB 43822
01/01/2020 ADIB 24199
The issue i have is writing groupby resample in two different ways return two different results
1-- df.groupby("name").resample("M", on="transaction_date").sum()[['revenue']].head(12)
2-- df.groupby("name").resample("M", on="transaction_date").aggregate({'revenue':'sum'}).head(12)
The first method returns the values I'm looking for.
I don't understand why the two methods return different results. Is this a bug?
Result 1
name transaction_date revenue
ADIB 2020-01-31 39170943.0
2020-02-29 48003966.0
2020-03-31 32691641.0
2020-04-30 11979337.0
2020-05-31 35510726.0
2020-06-30 25677857.0
2020-07-31 12437122.0
2020-08-31 4348936.0
2020-09-30 10547188.0
2020-10-31 5287406.0
2020-11-30 4288930.0
2020-12-31 17066105.0
Result 2
name transaction_date revenue
ADIB 2020-01-31 64128331.0
2020-02-29 54450014.0
2020-03-31 45636192.0
2020-04-30 25016777.0
2020-05-31 11941744.0
2020-06-30 15703151.0
2020-07-31 5517526.0
2020-08-31 4092618.0
2020-09-30 4333433.0
2020-10-31 3944117.0
2020-11-30 6528058.0
2020-12-31 5718196.0

Indeed, it's either a bug or an extremely strange behavior. Consider the following data:
input:
date revenue name
0 2020-10-27 0.744045 n_1
1 2020-10-29 0.074852 n_1
2 2020-11-21 0.560182 n_2
3 2020-12-29 0.208616 n_2
4 2020-05-03 0.325044 n_0
gb = df.groupby("name").resample("M", on="date")
gb.aggregate({'revenue':'sum'})
==>
revenue
name date
n_0 2020-12-31 0.325044
n_1 2020-05-31 0.744045
2020-06-30 0.000000
2020-07-31 0.000000
2020-08-31 0.000000
2020-09-30 0.000000
2020-10-31 0.074852
n_2 2020-10-31 0.560182
2020-11-30 0.208616
print(gb.sum()[['revenue']])
==>
revenue
name date
n_0 2020-05-31 0.325044
n_1 2020-10-31 0.818897
n_2 2020-11-30 0.560182
2020-12-31 0.208616
As one can see, it seems that aggregate produces the wrong results. For example, it takes data from Oct and attaches it to May.
Here's an even simpler example:
Data frame:
date revenue name
0 2020-02-24 9 n_1
1 2020-05-12 8 n_2
2 2020-03-28 9 n_2
3 2020-01-14 2 n_0
gb = df.groupby("name").resample("M", on="date")
res1 = gb.sum()[['revenue']]
==>
name date
n_0 2020-01-31 2
n_1 2020-02-29 9
n_2 2020-03-31 9
2020-04-30 0
2020-05-31 8
res2 = gb.aggregate({'revenue':'sum'})
==>
name date
n_0 2020-05-31 2
n_1 2020-01-31 9
n_2 2020-02-29 8
2020-03-31 9
I opened a bug about it: https://github.com/pandas-dev/pandas/issues/35173

Related

How to fill NANs with a specific row of data

I am a new python user and have a few questions regarding filling NA's of a data frame.
Currently, I have a data frame that has a series of dates from 2022-08-01 to 2037-08-01 with a frequency of monthly data.
However, after 2027-06-01 the pricing data stops and I would like to extrapolate the values forward to fill out the rest of the dates. Essentially I would like to take the last 12 months of prices and fill those forward for the rest of the data frame. I am thinking of doing some type of groupby month with a fillna(method=ffill) however when I do this it just fills the last value in the df forward. Below is an example of my code.
Above is a picture you will see that the values stop at 12/1/2023 I wish to fill the previous 12 values forward for the rest of the maturity dates. So all prices fro 1/1/2023 to 12/1/2023 will be fill forward for all months.
import pandas as pd
mat = pd.DataFrame(pd.date_range('01/01/2020','01/01/2022',freq='MS'))
prices = pd.DataFrame(['179.06','174.6','182.3','205.59','204.78','202.19','216.17','218.69','220.73','223.28','225.16','226.31'])
example = pd.concat([mat,prices],axis=1)
example.columns = ['maturity', 'price']
Output
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 NaN
13 2021-02-01 NaN
14 2021-03-01 NaN
15 2021-04-01 NaN
16 2021-05-01 NaN
17 2021-06-01 NaN
18 2021-07-01 NaN
19 2021-08-01 NaN
20 2021-09-01 NaN
21 2021-10-01 NaN
22 2021-11-01 NaN
23 2021-12-01 NaN
24 2022-01-01 NaN
Is this what you're looking for?
out = df.groupby(df.maturity.dt.month).ffill()
print(out)
Output:
maturity price
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 179.06
13 2021-02-01 174.6
14 2021-03-01 182.3
15 2021-04-01 205.59
16 2021-05-01 204.78
17 2021-06-01 202.19
18 2021-07-01 216.17
19 2021-08-01 218.69
20 2021-09-01 220.73
21 2021-10-01 223.28
22 2021-11-01 225.16
23 2021-12-01 226.31
24 2022-01-01 179.06

Group by date and sum value of the top 1% percentile?

date balance
2020-03-31 1000
2020-03-31 900
2020-03-31 800
2020-03-31 700
2020-03-31 200
2020-03-31 100
....
2020-03-31 20
2020-03-31 1
2020-03-31 0.3
....
2020-06-30 3420
2020-06-30 3000
2020-06-30 2000
....
2020-06-30 30
2020-06-30 3
....
2020-09-30 10000
2020-09-30 3000
..
2020-09-30 3
I want to group by date and sum value across those that belong to the largest 1% percentile.
I used
book2 = book.groupby(['date'])['balance'].agg([lambda x : np.quantile(x, q=0.99), "sum"])
but this is giving me a strange value...
Any idea how to solve this?
Thanks!
Search all values above the top 1% then sum them for each date:
df.groupby('date')['balance'].apply(lambda x: x[x >= np.quantile(x, q=0.99)].sum())

Adding values from another dataframe

I have a df1
Date Open Expiry Entry Strike
0 2020-01-03 12261.10 2020-01-09 12200.0
1 2020-01-10 12271.00 2020-01-16 12200.0
2 2020-01-17 12328.40 2020-01-23 12300.0
3 2020-01-24 12174.55 2020-01-30 12100.0
4 2020-01-31 12100.40 2020-02-06 12100.0
i want to add values from df2
Date Expiry Type Strike Price Open Close
0 2020-01-03 2020-01-09 CE 13100 0.0 65.85
1 2020-01-03 2020-01-09 CE 13150 0.0 59.40
2 2020-01-03 2020-01-09 CE 13200 0.0 53.55
3 2020-01-03 2020-01-09 CE 13250 0.0 48.15
4 2020-01-03 2020-01-09 CE 13300 0.0 43.25
i want to compare elements of column Date , Expiry and Entry Price with Date ,Expiry and Strike Price of df2 and add corresponding Open column element to df1 if the condition matches. when i directly compare columns i get errors like .
ValueError: Can only compare identically-labeled Series objects
thanks for the help
Did you try to do a simple merge and see if it solves what you want.
Do something like this:
pd.merge(df1,df2,how='left', left_on=['Date','Expiry','Entry Strike'], right_on=['Date','Expiry','Strike Price'])
Your output will be as shown below. For this, I modified the first row to match. Otherwise, the data has no matching records.
Date Open_x Expiry ... Strike Price Open_y Close
0 2020-01-03 12261.10 2020-01-09 ... 12200.0 0.0 43.25
1 2020-01-10 12271.00 2020-01-16 ... NaN NaN NaN
2 2020-01-17 12328.40 2020-01-23 ... NaN NaN NaN
3 2020-01-24 12174.55 2020-01-30 ... NaN NaN NaN
4 2020-01-31 12100.40 2020-02-06 ... NaN NaN NaN
You can then delete all columns that you don't want.
You can try with apply function aswell
def check(row):
dt = row['Date']
ex = row["Expiry"]
sp = row["Entry Strike"]
return df2[(df2['Date']==dt) & (df2['Expiry']==ex) & (df2["Strike Price"]==sp)]['Open']
df1['new_col'] = df1.apply(lambda x: check(x), axis = 1)
this will also work, check the output below I have changed one of the value to match one row.
df1
Date Open Expiry Entry Strike new_col
0 2020-01-03 12261.10 2020-01-09 13100.0 0.0
1 2020-01-10 12271.00 2020-01-16 12200.0 NaN
2 2020-01-17 12328.40 2020-01-23 12300.0 NaN
3 2020-01-24 12174.55 2020-01-30 12100.0 NaN
4 2020-01-31 12100.40 2020-02-06 12100.0 NaN

How to get which semester a day belongs to using pandas.Period

I would like to know an easy way to get which semester a day belongs to while displaying it on following format ('YYYY-SX'); 2018-01-01 -> (2018S1).
I have a date range and is pretty easy to do it for quarters:
import pandas as pd
import datetime
start = datetime.datetime(2018, 1, 1)
end = datetime.datetime(2020, 1, 1)
all_days = pd.date_range(start, end, freq='D')
all_quarters = []
for day in all_days:
all_quarters.append(str(pd.Period(day, freq='Q')))
However given the docs there is no frequency for semesters:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Period.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
I don't want to necessarily use any specific modules.
Any ideas on how to do it in a clean way?
You can do something like this.
df['sem']= df.date.dt.year.astype(str) + 'S'+ np.where(df.date.dt.quarter.gt(2),2,1).astype(str)
Note: the column date needs to be as datetime object
Input
date
0 2019-09-30
1 2019-10-31
2 2019-11-30
3 2019-12-31
4 2020-01-31
5 2020-02-29
6 2020-03-31
7 2020-04-30
8 2020-05-31
9 2020-06-30
Output
date sem
0 2019-09-30 2019S2
1 2019-10-31 2019S2
2 2019-11-30 2019S2
3 2019-12-31 2019S2
4 2020-01-31 2020S1
5 2020-02-29 2020S1
6 2020-03-31 2020S1
7 2020-04-30 2020S1
8 2020-05-31 2020S1
9 2020-06-30 2020S1

How do I use conditional logic with Datetime columns in Pandas?

I have two datetime columns - ColumnA and ColumnB. I want to create a new column - ColumnC, using conditional logic.
Originally, I created ColumnB from a YearMonth column of dates such as 201907, 201908, etc.
When ColumnA is NaN, I want to choose ColumnB.
Otherwise, I want to choose ColumnA.
Currently, my code below is causing ColumnC to have different formats. I'm not sure how to get rid of all of those 0's. I want the whole column to be YYYY-MM-DD.
ID YearMonth ColumnA ColumnB ColumnC
0 1 201712 2017-12-29 2017-12-31 2017-12-29
1 1 201801 2018-01-31 2018-01-31 2018-01-31
2 1 201802 2018-02-28 2018-02-28 2018-02-28
3 1 201806 2018-06-29 2018-06-30 2018-06-29
4 1 201807 2018-07-31 2018-07-31 2018-07-31
5 1 201808 2018-08-31 2018-08-31 2018-08-31
6 1 201809 2018-09-28 2018-09-30 2018-09-28
7 1 201810 2018-10-31 2018-10-31 2018-10-31
8 1 201811 2018-11-30 2018-11-30 2018-11-30
9 1 201812 2018-12-31 2018-12-31 2018-12-31
10 1 201803 NaN 2018-03-31 1522454400000000000
11 1 201804 NaN 2018-04-30 1525046400000000000
12 1 201805 NaN 2018-05-31 1527724800000000000
13 1 201901 NaN 2019-01-31 1548892800000000000
14 1 201902 NaN 2019-02-28 1551312000000000000
15 1 201903 NaN 2019-03-31 1553990400000000000
16 1 201904 NaN 2019-04-30 1556582400000000000
17 1 201905 NaN 2019-05-31 1559260800000000000
18 1 201906 NaN 2019-06-30 1561852800000000000
19 1 201907 NaN 2019-07-31 1564531200000000000
20 1 201908 NaN 2019-08-31 1567209600000000000
21 1 201909 NaN 2019-09-30 1569801600000000000
df['ColumnB'] = pd.to_datetime(df['YearMonth'], format='%Y%m', errors='coerce').dropna() + pd.offsets.MonthEnd(0)
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB'], format='%Y%m%d'), df['ColumnA'])
df['ColumnC'] = np.where(df['ColumnA'].isnull(),df['ColumnB'] , df['ColumnA'])
Just figured it out!
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB']), pd.to_datetime(df['ColumnA']))

Categories