pandas calculates column value means on groups and means across whole dataframe - python

I have a df, df['period'] = (df['date1'] - df['date2']) / np.timedelta64(1, 'D')
code y_m date1 date2 period
1000 201701 2017-12-10 2017-12-09 1
1000 201701 2017-12-14 2017-12-12 2
1000 201702 2017-12-15 2017-12-13 2
1000 201702 2017-12-17 2017-12-15 2
2000 201701 2017-12-19 2017-12-18 1
2000 201701 2017-12-12 2017-12-10 2
2000 201702 2017-12-11 2017-12-10 1
2000 201702 2017-12-13 2017-12-12 1
2000 201702 2017-12-11 2017-12-10 1
then groupby code and y_m to calculate the average of date1-date2,
df_avg_period = df.groupby(['code', 'y_m'])['period'].mean().reset_index(name='avg_period')
code y_m avg_period
1000 201701 1.5
1000 201702 2
2000 201701 1.5
2000 201702 1
but I like to convert df_avg_period into a matrix that transposes column code to rows and y_m to columns, like
0 1 2 3
0 -1 0 201701 201702
1 0 1.44 1.44 1.4
2 1000 1.75 1.5 2
3 2000 1.20 1.5 1
-1 represents a dummy value that indicates either a value doesn't exist for a specific code/y_m cell or to maintain matrix shape; 0 represents 'all' values, that averages the code or y_m or code and y_m, e.g. cell (1,1) averages the period values for all rows in df; (1,2) averages the period for 201701 across all rows that have this value for y_m in df.
apparently pivot_table cannot give correct results using mean. so I am wondering how to achieve that correctly?

pivot_table with margins=True
piv = df.pivot_table(
index='code', columns='y_m', values='period', aggfunc='mean', margins=True
)
# housekeeping
(piv.reset_index()
.rename_axis(None, 1)
.rename({'code' : -1, 'All' : 0}, axis=1)
.sort_index(axis=1)
)
-1 0 201701 201702
0 1000 1.750000 1.5 2.0
1 2000 1.200000 1.5 1.0
2 All 1.444444 1.5 1.4

Related

DataFrame groupby and divide by group sum

In order to build stock portfolios for a backtest I am trying to get the market capitalization (me) weight of each stock within its portfolio. For test purposes I built the following DataFrame of price and return observations. Every day I am assigning the stocks to quantiles based on price and all stocks in the same quantile that day will be in one portfolio:
d = {'date' : ['202211', '202211', '202211','202211', '202212', '202212', '202212', '202212'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 1],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3), duplicates='drop'))
df
price shrs me rank
date
2022-01-01 1.0 100 100.0 1
2022-01-01 1.2 100 120.0 1
2022-01-01 1.3 100 130.0 2
2022-01-01 1.5 100 150.0 2
2022-01-02 1.7 100 170.0 2
2022-01-02 2.0 100 200.0 2
2022-01-02 1.5 100 150.0 1
2022-01-02 1.0 100 100.0 1
In the next step I am grouping by 'date' and 'rank' and divide each observation's market cap by the sum of the groups market cap in order to obtain the stocks weight in the portfolio:
df['weight'] = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'] / x['me'].sum()).sort_index()
print(df)
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.600000
2022-01-02 2.0 100 200.0 2 0.400000
2022-01-02 1.5 100 150.0 1 0.459459
2022-01-02 1.0 100 100.0 1 0.540541
Now comes the flaw. On my test df this works perfectly fine. However on the real data (DataFrame with shape 160000 x 21) the calculations take endless and I always have to interrupt the Jupyter Kernel at some point. Is there a more efficient way to do this? What am I missing?
Interestingly I am using the same code as some colleagues on similar DataFrames and for them it takes seconds only.
Use GroupBy.transform with sum for new Series and use it for divide me column:
df['weight'] = df['me'].div(df.groupby(['date', 'rank'])['me'].transform('sum'))
It might not be the most elegant solution, but if you run into performance issue you can try to split it into multiple parts, but storing the groupped value of me in a Series and then merge it back
temp = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'].sum())
temp = temp.reset_index(name='weight')
df = df.merge(temp, on=['date', 'rank'])
df['weight'] = df['me'] / df['weight']
df.set_index('date', inplace=True)
df
which should lead to the output:
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.459459
2022-01-02 2.0 100 200.0 2 0.540541
2022-01-02 1.5 100 150.0 1 0.600000
2022-01-02 1.0 100 100.0 1 0.400000

Pandas add missing weeks from range to dataframe

I am computing a DataFrame with weekly amounts and now I need to fill it with missing weeks from a provided date range.
This is how I'm generating the dataframe with the weekly amounts:
df['date'] = pd.to_datetime(df['date']) - timedelta(days=6)
weekly_data: pd.DataFrame = (df
.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
.reset_index()
)
Which outputs:
date sum
0 2020-10-11 78
1 2020-10-18 673
If a date range is given as start='2020-08-30' and end='2020-10-30', then I would expect the following dataframe:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0.0
So far, I have managed to just add the missing weeks and set the sum to 0, but it also replaces the existing values:
weekly_data = weekly_data.reindex(pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN')).fillna(0)
Which outputs:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 0.0 # should be 78
7 2020-10-18 0.0 # should be 673
8 2020-10-25 0.0
Remove reset_index for DatetimeIndex, because reindex working with index and if RangeIndex get 0 values, because no match:
weekly_data = (df.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
)
Then is possible use fill_value=0 parameter and last add reset_index:
r = pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN', name='date')
weekly_data = weekly_data.reindex(r, fill_value=0).reset_index()
print (weekly_data)
date sum
0 2020-08-30 0
1 2020-09-06 0
2 2020-09-13 0
3 2020-09-20 0
4 2020-09-27 0
5 2020-10-04 0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0

Build rows in Python Dataframe, based on values in previous column

My input looks like this:
import datetime as dt
import pandas as pd
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], \
'MONEY':some_money})
df
Producing the following:
I want to add 3 more columns, getting the MONEY value for the previous month, like this (color coding for illustrative purposes):
This is what I have tried:
prev_period_money = ["m-1", "m-2", "m-3"]
for m in prev_period_money:
df[m] = df["MONEY"] - 10 #well, it "works", but it gives df["MONEY"]- 10...
The TIME column is sorted, so one should not care about it. (But it would be great, if someone shows the "magic", being able to get data from it.)
Use for pandas 0.24+ fill_value=0 in Series.shift, then also are correct integers columns:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x, fill_value=0)
print (df)
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42 300 450
1 2019-12 42 300 450 550
2 2019-11 300 450 550 0
3 2019-10 450 550 0 0
4 2019-09 550 0 0 0
For pandas below 0.24 is necessary replace mising values and convert to integers:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x).fillna(0).astype(int)
It is quite easy if you use shift
That would give you the desired output:
df["m-1"] = df["MONEY"].shift(periods=-1)
df["m-2"] = df["MONEY"].shift(periods=-2)
df["m-3"] = df["MONEY"].shift(periods=-3)
df = df.fillna(0)
This would work only if it's ordered. Otherwise you have to order it before.
My suggestion: Use a list comprehension with the shift function to get your three columns, concat them on columns, and concatenate it again to the original dataframe
(pd.concat([df,pd.concat([df.MONEY.shift(-i) for i in
range(1,4)],axis=1)],
axis=1)
.fillna(0)
)
TIME MONEY MONEY MONEY MONEY
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0
import pandas as pd
columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov"]
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], 'MONEY':some_money})
prev_period_money = ["m-1", "m-2", "m-3"]
count = 1
for m in prev_period_money:
df[m] = df['MONEY'].iloc[count:].reset_index(drop=True)
count += 1
df = df.fillna(0)
Output:
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0

Pandas Dataframe: shift/merge multiple rows sharing the same column values into one row

Sorry for any possible confusion with the title. I will describe my question better with the following code and pictures.
Now I have a dataframe with multiple columns. The first two columns, by which they are sorted, 'Route' and 'ID' (Sorry about the formatting, all the rows here have 'Route' value of '100' and 'ID' from 1 to 3.
df1.head(9)
Route ID Year Vol Truck_Vol Truck_%
0 100 1 2017.0 7016 635.0 9.1
1 100 1 2014.0 6835 NaN NaN
2 100 1 2011.0 5959 352.0 5.9
3 100 2 2018.0 15828 NaN NaN
4 100 2 2015.0 13114 2964.0 22.6
5 100 2 2009.0 11844 1280.0 10.8
6 100 3 2016.0 15434 NaN NaN
7 100 3 2013.0 18699 2015.0 10.8
8 100 3 2010.0 15903 NaN NaN
What I want to have is
Route ID Year Vol1 Truck_Vol1 Truck_%1 Year2 Vol2 Truck_Vol2 Truck_%2 Year3 Vol3 Truck_Vol3 Truck_%3
0 100 1 2017 7016 635.0 9.1 2014 6835 NaN NaN 2011 5959 352.0 5.9
1 100 2 2018 15828 NaN NaN 2015 13114 2964.0 22.6 2009 11844 1280.0 10.8
2 100 3 2016 15434 NaN NaN 2013 18699 2015.0 10.8 2010 15903 NaN NaN
Again, sorry for the messy formatting. Let me try a simplified version.
Input:
Route ID Year Vol T_%
0 100 1 2017 100 1.0
1 100 1 2014 200 NaN
2 100 1 2011 300 2.0
3 100 2 2018 400 NaN
4 100 2 2015 500 3.0
5 100 2 2009 600 4.0
Desired Output:
Route ID Year Vol T_% Year.1 Vol.1 T_%.1 Year.2 Vol.2 T_%.2
0 100 1 2017 100 1.0 2014 200 NaN 2011 300 2
1 100 2 2018 400 NaN 2015 500 3.0 2009 600 4
So basically just move the cells shown in the picture
I am stumped here. The names for the newly generated columns don't matter.
For this current dataframe, I have three rows per 'group' like shown in the code. It will be great if the answer can accommodate any number of rows each group.
Thanks for your time.
with groupby + cumcount + set_index + unstack
df1 = df.assign(cid = df.groupby(['Route', 'ID']).cumcount()).set_index(['Route', 'ID', 'cid']).unstack(-1).sort_index(1,1)
df1.columns = [f'{x}{y}' for x,y in df1.columns]
df1 = df1.reset_index()
Output df1:
Route ID T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
0 100 1 1.0 100 2017 NaN 200 2014 2.0 300 2011
1 100 2 NaN 400 2018 3.0 500 2015 4.0 600 2009
melt + pivot_table
v = df.melt(id_vars=['Route', 'ID'])
v['variable'] += v.groupby(['Route', 'ID', 'variable']).cumcount().astype(str)
res = v.pivot_table(index=['Route', 'ID'], columns='variable', values='value')
variable T_% 0 T_% 1 T_% 2 Vol 0 Vol 1 Vol 2 Year 0 Year 1 Year 2
Route ID
100 1 1.0 NaN 2.0 100.0 200.0 300.0 2017.0 2014.0 2011.0
2 NaN 3.0 4.0 400.0 500.0 600.0 2018.0 2015.0 2009.0
If you want to sort these:
c = res.columns.str.extract(r'(\d+)')[0].values.astype(int)
res.iloc[:,np.argsort(c)]
variable T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
Route ID
100 1 1.0 100.0 2017.0 NaN 200.0 2014.0 2.0 300.0 2011.0
2 NaN 400.0 2018.0 3.0 500.0 2015.0 4.0 600.0 2009.0
You asked about why I used cumcount. To explain, here is what v looks like from above:
Route ID variable value
0 100 1 Year 2017.0
1 100 1 Year 2014.0
2 100 1 Year 2011.0
3 100 2 Year 2018.0
4 100 2 Year 2015.0
5 100 2 Year 2009.0
6 100 1 Vol 100.0
7 100 1 Vol 200.0
8 100 1 Vol 300.0
9 100 2 Vol 400.0
10 100 2 Vol 500.0
11 100 2 Vol 600.0
12 100 1 T_% 1.0
13 100 1 T_% NaN
14 100 1 T_% 2.0
15 100 2 T_% NaN
16 100 2 T_% 3.0
17 100 2 T_% 4.0
If I used pivot_table on this DataFrame, you would end up with something like this:
variable T_% Vol Year
Route ID
100 1 1.5 200.0 2014.0
2 3.5 500.0 2014.0
Obviously you are losing data here. cumcount is the solution, as it turns the variable series into this:
Route ID variable value
0 100 1 Year0 2017.0
1 100 1 Year1 2014.0
2 100 1 Year2 2011.0
3 100 2 Year0 2018.0
4 100 2 Year1 2015.0
5 100 2 Year2 2009.0
6 100 1 Vol0 100.0
7 100 1 Vol1 200.0
8 100 1 Vol2 300.0
9 100 2 Vol0 400.0
10 100 2 Vol1 500.0
11 100 2 Vol2 600.0
12 100 1 T_%0 1.0
13 100 1 T_%1 NaN
14 100 1 T_%2 2.0
15 100 2 T_%0 NaN
16 100 2 T_%1 3.0
17 100 2 T_%2 4.0
Where you have a count of repeated elements per unique Route and ID.

Pandas dataframe Groupby and retrieve date range

Here is my dataframe that I am working on. There are two pay periods defined:
first 15 days and last 15 days for each month.
date employee_id hours_worked id job_group report_id
0 2016-11-14 2 7.50 385 B 43
1 2016-11-15 2 4.00 386 B 43
2 2016-11-30 2 4.00 387 B 43
3 2016-11-01 3 11.50 388 A 43
4 2016-11-15 3 6.00 389 A 43
5 2016-11-16 3 3.00 390 A 43
6 2016-11-30 3 6.00 391 A 43
I need to group by employee_id and job_group but at the same time
I have to achieve date range for that grouped row.
For example grouped results would be like following for employee_id 1:
Expected Output:
date employee_id hours_worked job_group report_id
1 2016-11-15 2 11.50 B 43
2 2016-11-30 2 4.00 B 43
4 2016-11-15 3 17.50 A 43
5 2016-11-16 3 9.00 A 43
Is this possible using pandas dataframe groupby?
Use SM with Grouper and last add SemiMonthEnd:
df['date'] = pd.to_datetime(df['date'])
d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
.agg(d)
.reset_index())
df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
employee_id job_group date hours_worked report_id
0 2 B 2016-11-15 11.5 43
1 2 B 2016-11-30 4.0 43
2 3 A 2016-11-15 17.5 43
3 3 A 2016-11-30 9.0 43
a. First, (for each employee_id) use multiple Grouper with the .sum() on the hours_worked column. Second, use DateOffset to achieve bi-weekly date column. After these 2 steps, I have assigned the date in the grouped DF based on 2 brackets (date ranges) - if day of month (from the date column) is <=15, then I set the day in date to 15, else I set the day to 30. This day is then used to assemble a new date. I calculated month end day based on 1, 2.
b. (For each employee_id) get the .last() record for the job_group and report_id columns
c. merge a. and b. on the employee_id key
# a.
hours = (df.groupby([
pd.Grouper(key='employee_id'),
pd.Grouper(key='date', freq='SM')
])['hours_worked']
.sum()
.reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)
# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
month=hours['date'].dt.month,
day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)
# b.
others = (df.groupby('employee_id')['job_group','report_id']
.last()
.reset_index())
# c.
merged = hours.merge(others, how='inner', on='employee_id')
Raw data for employee_id==1 and employeeid==3
df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])
index date employee_id hours_worked id job_group report_id
0 0 2016-11-14 1 7.5 481 A 43
10 10 2016-11-21 1 6.0 491 A 43
11 11 2016-11-22 1 5.0 492 A 43
15 15 2016-12-14 1 7.5 496 A 43
25 25 2016-12-21 1 6.0 506 A 43
26 26 2016-12-22 1 5.0 507 A 43
6 6 2016-11-02 3 6.0 487 A 43
4 4 2016-11-08 3 6.0 485 A 43
3 3 2016-11-09 3 11.5 484 A 43
5 5 2016-11-11 3 3.0 486 A 43
20 20 2016-11-12 3 3.0 501 A 43
21 21 2016-12-02 3 6.0 502 A 43
19 19 2016-12-08 3 6.0 500 A 43
18 18 2016-12-09 3 11.5 499 A 43
Output
print(merged)
employee_id date hours_worked job_group report_id
0 1 2016-11-15 7.5 A 43
1 1 2016-11-30 11.0 A 43
2 1 2016-12-15 7.5 A 43
3 1 2016-12-31 11.0 A 43
4 2 2016-11-15 31.0 B 43
5 2 2016-12-15 31.0 B 43
6 3 2016-11-15 29.5 A 43
7 3 2016-12-15 23.5 A 43
8 4 2015-03-15 5.0 B 43
9 4 2016-02-29 5.0 B 43
10 4 2016-11-15 5.0 B 43
11 4 2016-11-30 15.0 B 43
12 4 2016-12-15 5.0 B 43
13 4 2016-12-31 15.0 B 43

Categories