DataFrame groupby and divide by group sum - python

In order to build stock portfolios for a backtest I am trying to get the market capitalization (me) weight of each stock within its portfolio. For test purposes I built the following DataFrame of price and return observations. Every day I am assigning the stocks to quantiles based on price and all stocks in the same quantile that day will be in one portfolio:
d = {'date' : ['202211', '202211', '202211','202211', '202212', '202212', '202212', '202212'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 1],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3), duplicates='drop'))
df
price shrs me rank
date
2022-01-01 1.0 100 100.0 1
2022-01-01 1.2 100 120.0 1
2022-01-01 1.3 100 130.0 2
2022-01-01 1.5 100 150.0 2
2022-01-02 1.7 100 170.0 2
2022-01-02 2.0 100 200.0 2
2022-01-02 1.5 100 150.0 1
2022-01-02 1.0 100 100.0 1
In the next step I am grouping by 'date' and 'rank' and divide each observation's market cap by the sum of the groups market cap in order to obtain the stocks weight in the portfolio:
df['weight'] = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'] / x['me'].sum()).sort_index()
print(df)
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.600000
2022-01-02 2.0 100 200.0 2 0.400000
2022-01-02 1.5 100 150.0 1 0.459459
2022-01-02 1.0 100 100.0 1 0.540541
Now comes the flaw. On my test df this works perfectly fine. However on the real data (DataFrame with shape 160000 x 21) the calculations take endless and I always have to interrupt the Jupyter Kernel at some point. Is there a more efficient way to do this? What am I missing?
Interestingly I am using the same code as some colleagues on similar DataFrames and for them it takes seconds only.

Use GroupBy.transform with sum for new Series and use it for divide me column:
df['weight'] = df['me'].div(df.groupby(['date', 'rank'])['me'].transform('sum'))

It might not be the most elegant solution, but if you run into performance issue you can try to split it into multiple parts, but storing the groupped value of me in a Series and then merge it back
temp = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'].sum())
temp = temp.reset_index(name='weight')
df = df.merge(temp, on=['date', 'rank'])
df['weight'] = df['me'] / df['weight']
df.set_index('date', inplace=True)
df
which should lead to the output:
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.459459
2022-01-02 2.0 100 200.0 2 0.540541
2022-01-02 1.5 100 150.0 1 0.600000
2022-01-02 1.0 100 100.0 1 0.400000

Related

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

pandas calculates column value means on groups and means across whole dataframe

I have a df, df['period'] = (df['date1'] - df['date2']) / np.timedelta64(1, 'D')
code y_m date1 date2 period
1000 201701 2017-12-10 2017-12-09 1
1000 201701 2017-12-14 2017-12-12 2
1000 201702 2017-12-15 2017-12-13 2
1000 201702 2017-12-17 2017-12-15 2
2000 201701 2017-12-19 2017-12-18 1
2000 201701 2017-12-12 2017-12-10 2
2000 201702 2017-12-11 2017-12-10 1
2000 201702 2017-12-13 2017-12-12 1
2000 201702 2017-12-11 2017-12-10 1
then groupby code and y_m to calculate the average of date1-date2,
df_avg_period = df.groupby(['code', 'y_m'])['period'].mean().reset_index(name='avg_period')
code y_m avg_period
1000 201701 1.5
1000 201702 2
2000 201701 1.5
2000 201702 1
but I like to convert df_avg_period into a matrix that transposes column code to rows and y_m to columns, like
0 1 2 3
0 -1 0 201701 201702
1 0 1.44 1.44 1.4
2 1000 1.75 1.5 2
3 2000 1.20 1.5 1
-1 represents a dummy value that indicates either a value doesn't exist for a specific code/y_m cell or to maintain matrix shape; 0 represents 'all' values, that averages the code or y_m or code and y_m, e.g. cell (1,1) averages the period values for all rows in df; (1,2) averages the period for 201701 across all rows that have this value for y_m in df.
apparently pivot_table cannot give correct results using mean. so I am wondering how to achieve that correctly?
pivot_table with margins=True
piv = df.pivot_table(
index='code', columns='y_m', values='period', aggfunc='mean', margins=True
)
# housekeeping
(piv.reset_index()
.rename_axis(None, 1)
.rename({'code' : -1, 'All' : 0}, axis=1)
.sort_index(axis=1)
)
-1 0 201701 201702
0 1000 1.750000 1.5 2.0
1 2000 1.200000 1.5 1.0
2 All 1.444444 1.5 1.4

finding intersection of intervals in pandas

I have two dataframes
df_a=
Start Stop Value
0 0 100 0.0
1 101 200 1.0
2 201 1000 0.0
df_b=
Start Stop Value
0 0 50 0.0
1 51 300 1.0
2 301 1000 0.0
I would like to generate a DataFrame which contains the intervals as identified by Start and Stop, where Value was the same in df_a and df_b. For each interval I would like to store: if Value was the same, and which was the value in df_a and df_b.
Desired output:
df_out=
Start Stop SameValue Value_dfA Value_dfB
0 50 1 0 0
51 100 0 0 1
101 200 1 1 1
201 300 0 0 1
[...]
Not sure if this is the best way to do this but you can reindex, join, groupby and agg to get your intervals, e.g.:
Expand each df so that the index is every single value of the range (Start to Stop) using reindex() and padding the values:
In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded
Out[]:
Stop Value
Start
0 100.0 0.0
1 100.0 0.0
2 100.0 0.0
3 100.0 0.0
4 100.0 0.0
...
997 1000.0 0.0
998 1000.0 0.0
999 1000.0 0.0
1000 1000.0 0.0
[1001 rows x 2 columns]
In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')
Join the two expanded dfs:
In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df
Out[]:
Start Stop_dfA Value_dfA Stop_dfB Value_dfB
0 0 100.0 0.0 50.0 0.0
1 1 100.0 0.0 50.0 0.0
2 2 100.0 0.0 50.0 0.0
3 3 100.0 0.0 50.0 0.0
4 4 100.0 0.0 50.0 0.0
...
Note: you can ignore the Stop columns and could have dropped them in the previous step.
There is no standard way to groupby only consecutive values (à la itertools.groupby), so resorting to a cumsum() hack:
In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)
Now you can get the result you want by aggregating the group with min, max:
In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out
Out[]:
Value_dfA Value_dfB Start Stop
0 0.0 0.0 0 50
1 0.0 1.0 51 100
2 1.0 1.0 101 200
3 0.0 1.0 201 300
4 0.0 0.0 301 1000
Now you just have to add the SameValue column and, if desired, order the columns to get the exact output you want:
In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Out[]:
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
This assumes the ranges of the two dataframes are the same, or you will need to handle the NaNs you will get with the join().
I found a way but not sure it is the most efficient. You have the input data:
import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})
First I would create the columns Start and Stop of df_out with:
df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])),
'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})
Then to get the value of dfa (and dfb) associated to the right range(Start,Stop) in a column named Value_dfA (and Value_dfB), I would do:
df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])
To get the column SameValue, do:
df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)
If it matters, you can reorder the columns with:
df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Your output is then
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
I have O(nlog(n)) solution where n is the sum of rows of df_a and df_b. Here's how it goes:
Rename value column of both dataframes to value_a and value_b repsectively. Next append df_b to df_a.
df = df_a.append(df_b)
Sort the df with respect to start column.
df = df.sort_values('start')
Resulting dataframe will look like this:
start stop value_a value_b
0 0 100 0.0 NaN
0 0 50 NaN 0.0
1 51 300 NaN 1.0
1 101 200 1.0 NaN
2 201 1000 0.0 NaN
2 301 1000 NaN 0.0
Forward fill the missing values:
df = df.fillna(method='ffill')
Compute same_value column:
df['same_value'] = df['value_a'] == df['value_b']
Recompute stop column:
df.stop = df.start.shift(-1)
You will get the dataframe you desire (except the first and last row which is pretty easy to fix):
start stop value_a value_b same_value
0 0 0.0 0.0 NaN False
0 0 51.0 0.0 0.0 True
1 51 101.0 0.0 1.0 False
1 101 201.0 1.0 1.0 True
2 201 301.0 0.0 1.0 False
2 301 NaN 0.0 0.0 True
Here is an answer which computes the overlapping intervals really quickly (which answers the question in the title):
from io import StringIO
import pandas as pd
from ncls import NCLS
c1 = StringIO("""Start Stop Value
0 100 0.0
101 200 1.0
201 1000 0.0""")
c2 = StringIO("""Start Stop Value
0 50 0.0
51 300 1.0
301 1000 0.0""")
df1 = pd.read_table(c1, sep="\s+")
df2 = pd.read_table(c2, sep="\s+")
ncls = NCLS(df1.Start.values, df1.Stop.values, df1.index.values)
x1, x2 = ncls.all_overlaps_both(df2.Start.values, df2.Stop.values, df2.index.values)
df1 = df1.reindex(x2).reset_index(drop=True)
df2 = df2.reindex(x1).reset_index(drop=True)
# print(df1)
# print(df2)
df = df1.join(df2, rsuffix="2")
print(df)
# Start Stop Value Start2 Stop2 Value2
# 0 0 100 0.0 0 50 0.0
# 1 0 100 0.0 51 300 1.0
# 2 101 200 1.0 51 300 1.0
# 3 201 1000 0.0 51 300 1.0
# 4 201 1000 0.0 301 1000 0.0
With this final df it should be simple to get to the result you need (but it is left as an exercise for the reader).
See NCLS for the interval overlap data structure.

Obtaining the class with maximum frequency(python)

so based on the following groupby code:
aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size().unstack()
I get the following output:
class_predict 0 1
S3bin2 S105_9bin2
50 50 16058.0 133.0
100 256.0 7.0
150 161.0 NaN
200 160.0 1.0
400000 4195.0 58.0
100 50 3480.0 20.0
100 68.0 NaN
150 43.0 1.0
200 48.0 1.0
400000 689.0 2.0
150 50 1617.0 6.0
100 73.0 NaN
150 33.0 NaN
200 52.0 NaN
400000 935.0 3.0
200 50 1155.0 8.0
100 73.0 1.0
150 37.0 NaN
200 45.0 NaN
400000 937.0 NaN
300000 50 11508.0 178.0
100 748.0 11.0
150 446.0 5.0
200 350.0 9.0
400000 13080.0 49.0
So for the group 50 in both S3bin2 and S105_9bin2, the frequency of 0 is the highest. Is it possible to run a function whereby I can print the groups for which 0 has highest count, and also the count? I've tried transform(max) and other things but I'm not getting it.
Solution for test maximum in all data:
First you can remove unstack and add aggregate for max and idxmin and last create output by format:
s = aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size()
a = s.agg(['idxmax', 'max'])
print (a)
idxmax (50, 50, 0)
max 16058
dtype: object
print (s.index.names)
['S3bin2', 'S105_9bin2', None]
a,b,c = a['max'], a['idxmax'], s.index.names
d = 'Maximum failure ({0}) at {1[0]}({2[0]}) and {1[1]}({2[1]})'.format(a,b,c)
print (d)
Maximum failure (16058) at 50(S3bin2) and 50(S105_9bin2)
But if want test only column 0 or 1:
df = aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size().unstack()
#change 0 to 1 for test column 1
a = df[0].agg(['idxmax', 'max'])
print (a)
idxmax (50, 50)
max 16058
Name: 0, dtype: object
a,b,c = a['max'], a['idxmax'], df.index.names
d = 'Maximum failure ({0}) at {1[0]}({2[0]}) and {1[1]}({2[1]})'.format(a,b,c)
print (d)
Maximum failure (16058.0) at 50(S3bin2) and 50(S105_9bin2)

Categories