Rolling Median in a Pandas Pivot Table - python

I am trying to calculate a rolling median as an aggregated function on a pandas dataframe. Here is some sample data:
import pandas as pd
import numpy as np
d = {'date': ['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-02-01','2020-02-01','2020-03-01','2020-02-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01'],
'count': [1,1,1,2,2,3,3,3,4,3,3,3,1],
'type': ['type1','type2','type3','type1','type3','type1','type2','type2','type2','type3','type1','type2','type1'],
'salary':[1000,2000,3000,10000,15000,30000,100000,50000,25000,10000,25000,30000,40000]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pvt: pd.DataFrame = df.pivot_table(index='date',
columns='type',
aggfunc={'salary': np.median})
df_pvt.head(5)
I would like to perform a rolling median on the salaries using pandas rolling(2).median() function.
How can I go about inserting this type of window function into the aggregate function for a pivot table?
My goal is to aggregate a large amount of numeric data by date and take the rolling median of variable lengths and report that in my resulting pivot table. I am not entirely sure how to insert this function into aggfunc or the like.
The expected output orders by the date in ascending order and takes all observations associated with both months and finds the median.
For type1 we have:
date count type salary
0 2020-01-01 1 type1 1000
3 2020-01-01 2 type1 10000
5 2020-02-01 3 type1 30000
10 2020-03-01 3 type1 25000
12 2020-03-01 1 type1 40000
Thus, for type1 the expected output with rolling(2) would be:
salary
type type1
date
2020-01-01 NaN
2020-02-01 10000.0
2020-03-01 30000.0
The logic follows that for the first 2 month rolling window we would have data points 1000,10000 and 30000 and produce a median of 10000.
For 2020-03-01, the rolling 2 would include 30000, 25000, 40000 so the median result should be 30000.

Not sure it can be done directly with the parameter aggfunc. so a work around could be to create the a double of the data with a date column shifted of a month. Note that this method is not really scalable to bigger rolling window. it can but you may end up with too much data.
# first convert to datetime
df['date'] = pd.to_datetime(df['date'])
# append the data shifted of a month to df and perform the pivot_table
res = (
df
.append(df.assign(date=lambda x: x['date']+pd.DateOffset(months=1)))
.pivot_table(index='date',columns='type',
aggfunc={'salary': np.median})
.reindex(df['date'].unique()) # to avoid an extra month
)
print(res)
salary
type type1 type2 type3
date
2020-01-01 5500.0 NaN NaN
2020-02-01 10000.0 26000.0 15000.0
2020-03-01 30000.0 30000.0 10000.0
for the first date if you want to get nan as a rolling window would do, then you can do res.loc[res.index.min()] = np.nan after

Related

Create a pivot table where the value is held flat until a certain date

I want to create a pivot table with months as the columns and items as rows. Currently the data is in a table that looks like this:
Item
Balance
Maturity
A
100
1/31/23
B
150
2/28/23
C
200
3/31/23
But I want the data to look like this:
1/31/23
2/28/23
3/31/23
A
100
B
150
150
C
200
200
200
In python I have created a date range with frequency 'M'. The idea I am trying to accomplish is if the date is less than the Maturity date, repeat balance.
pivot and bfill:
(df.pivot(index='Item', columns='Maturity', values='Balance')
.sort_index(axis=1, key=lambda x: pd.to_datetime(x, dayfirst=False))
.bfill(axis=1)
)
NB. sort_index using conversion to_datetime to ensure the correct order of the columns.
Output:
Maturity 1/31/23 2/28/23 3/31/23
Item
A 100.0 NaN NaN
B 150.0 150.0 NaN
C 200.0 200.0 200.0

merging quarterly and monthly data while doing ffill on multiindex

I am trying to merge a quarterly series and a monthly series, and in the process essentially "downsampling" the quarterly series. Both dataframes contain a DATE column, BANK, and the remaining columns are various values either in a monthly or quarterly format. The complication I have had is that it is a multiindex, so if I try:
merged_data=df1.join(df2).reset_index(['DATE', 'BANK_CODE']).ffill()
the forward fill for quarterly data up to the last monthly datapoint is not done for each respective bank as I intended. Could anyone help with this please? Note: I have also tried to resample the quarterly dataframe separately, however I do not know of a way to downsample it to a monthly level until a certain date (should be the latest date in the monthly data).
df2 = df2.set_index(['DATE']).groupby(['BANK']).resample('M')['VALUE'].ffill()
df1:
Date Bank Value1 Value2
2021-06-30 bank 1 2000 7000
2021-07-31 bank 1 3000 2000
2021-06-30 bank 2 6000 9000
df2:
Date Bank Value1 Value2
2021-06-30 bank 1 2000 5000
2021-09-30 bank 1 5000 4000
2021-06-30 bank 2 9000 10000
HERE IS A MINI EXAMPLE
Using the data provided, assuming df1 is monthly and df2 is quarterly.
Set index and resample your quarterly data to monthly:
# monthly data
x1 = df1.set_index(['Bank','Date'])
# quarterly data, resampling back to monthly
x2 = ( df2.set_index('Date')
.groupby('Bank')
.resample('M')
.ffill()
.drop(columns='Bank')
)
Merge both - I assume you want the product, not the union:
x1.join(x2, lsuffix='_m', rsuffix='_q', how='outer').fillna(0)
Value1_m Value2_m Value1_q Value2_q
Bank Date
bank 1 2021-06-30 2000.0 7000.0 2000 5000
2021-07-31 3000.0 2000.0 2000 5000
2021-08-31 0.0 0.0 2000 5000
2021-09-30 0.0 0.0 5000 4000
bank 2 2021-06-30 6000.0 9000.0 9000 10000
The _m suffices are the values from df1, _q are from df2. I'm assuming you'll know how to explain or deal with the differences between monthly and quarterly values on the same dates.
As you can see, no need to specify the interval, this is provided automatically.

How do I aggregate rows in a pandas dataframe according to the latest dates in a column?

I have a dataframe containing materials, dates of purchase and purchase prices. I want to filter my dataframe such that I only keep one row containing each material, and that row contains the material at the latest purchase date and corresponding price.
How could I achieve this? I have racked my brains trying to work out how to apply aggregation functions to this but I just can't work out how.
Do a multisort and then use drop duplicates, keeping the first occurrence.
import pandas as pd
df.sort_values(by=['materials', 'purchase_date'], ascending=[True, False], inplace=True)
df.drop_duplicates(subset=['materials'], keep='first', inplace=True)
Two steps
sort_values() by material and purchaseDate
groupby() material and take first row
d = pd.date_range("1-apr-2020", "30-oct-2020", freq="W")
df = pd.DataFrame({"material":np.random.choice(list("abcd"),len(d)), "purchaseDate":d, "purchasePrice":np.random.randint(1,100, len(d))})
df.sort_values(["material","purchaseDate"], ascending=[1,0]).groupby("material", as_index=False).first()
output
material
purchaseDate
purchasePrice
0
a
2020-09-27 00:00:00
85
1
b
2020-10-25 00:00:00
54
2
c
2020-10-11 00:00:00
21
3
d
2020-10-18 00:00:00
45

Pandas replace NaN values with zeros after pivot operation

I have a DataFrame like this:
Year Month Day Rain (mm)
2021 1 1 15
2021 1 2 NaN
2021 1 3 12
And so on (there are multiple years). I have used pivot_table function to convert the DataFrame into this:
Year 2021 2020 2019 2018 2017
Month Day
1 1 15
2 NaN
3 12
I used:
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first')
Now I would like to replace all NaN values and also possible -1 values with zeros from every column (by columns I mean years) but I have not been able to do so. I have tried:
df = df.fillna(0)
And also:
df.loc[df['Rain (mm)'] == NaN, 'Rain (mm)'] = 0
But neither won't work, no error message/exception, dataframe just remains unchanged. What I'm doing wrong? Any advise is highly appreciated.
I think problem is NaN are strings, so cannot replace them, so first try convert values to numeric:
df['Rain (mm)'] = pd.to_numeric(df['Rain (mm)'], errors='coerce')
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first').fillna(0)

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

Categories