Build rows in Python Dataframe, based on values in previous column - python

My input looks like this:
import datetime as dt
import pandas as pd
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], \
'MONEY':some_money})
df
Producing the following:
I want to add 3 more columns, getting the MONEY value for the previous month, like this (color coding for illustrative purposes):
This is what I have tried:
prev_period_money = ["m-1", "m-2", "m-3"]
for m in prev_period_money:
df[m] = df["MONEY"] - 10 #well, it "works", but it gives df["MONEY"]- 10...
The TIME column is sorted, so one should not care about it. (But it would be great, if someone shows the "magic", being able to get data from it.)

Use for pandas 0.24+ fill_value=0 in Series.shift, then also are correct integers columns:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x, fill_value=0)
print (df)
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42 300 450
1 2019-12 42 300 450 550
2 2019-11 300 450 550 0
3 2019-10 450 550 0 0
4 2019-09 550 0 0 0
For pandas below 0.24 is necessary replace mising values and convert to integers:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x).fillna(0).astype(int)

It is quite easy if you use shift
That would give you the desired output:
df["m-1"] = df["MONEY"].shift(periods=-1)
df["m-2"] = df["MONEY"].shift(periods=-2)
df["m-3"] = df["MONEY"].shift(periods=-3)
df = df.fillna(0)
This would work only if it's ordered. Otherwise you have to order it before.

My suggestion: Use a list comprehension with the shift function to get your three columns, concat them on columns, and concatenate it again to the original dataframe
(pd.concat([df,pd.concat([df.MONEY.shift(-i) for i in
range(1,4)],axis=1)],
axis=1)
.fillna(0)
)
TIME MONEY MONEY MONEY MONEY
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0

import pandas as pd
columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov"]
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], 'MONEY':some_money})
prev_period_money = ["m-1", "m-2", "m-3"]
count = 1
for m in prev_period_money:
df[m] = df['MONEY'].iloc[count:].reset_index(drop=True)
count += 1
df = df.fillna(0)
Output:
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0

Related

DataFrame groupby and divide by group sum

In order to build stock portfolios for a backtest I am trying to get the market capitalization (me) weight of each stock within its portfolio. For test purposes I built the following DataFrame of price and return observations. Every day I am assigning the stocks to quantiles based on price and all stocks in the same quantile that day will be in one portfolio:
d = {'date' : ['202211', '202211', '202211','202211', '202212', '202212', '202212', '202212'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 1],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3), duplicates='drop'))
df
price shrs me rank
date
2022-01-01 1.0 100 100.0 1
2022-01-01 1.2 100 120.0 1
2022-01-01 1.3 100 130.0 2
2022-01-01 1.5 100 150.0 2
2022-01-02 1.7 100 170.0 2
2022-01-02 2.0 100 200.0 2
2022-01-02 1.5 100 150.0 1
2022-01-02 1.0 100 100.0 1
In the next step I am grouping by 'date' and 'rank' and divide each observation's market cap by the sum of the groups market cap in order to obtain the stocks weight in the portfolio:
df['weight'] = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'] / x['me'].sum()).sort_index()
print(df)
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.600000
2022-01-02 2.0 100 200.0 2 0.400000
2022-01-02 1.5 100 150.0 1 0.459459
2022-01-02 1.0 100 100.0 1 0.540541
Now comes the flaw. On my test df this works perfectly fine. However on the real data (DataFrame with shape 160000 x 21) the calculations take endless and I always have to interrupt the Jupyter Kernel at some point. Is there a more efficient way to do this? What am I missing?
Interestingly I am using the same code as some colleagues on similar DataFrames and for them it takes seconds only.
Use GroupBy.transform with sum for new Series and use it for divide me column:
df['weight'] = df['me'].div(df.groupby(['date', 'rank'])['me'].transform('sum'))
It might not be the most elegant solution, but if you run into performance issue you can try to split it into multiple parts, but storing the groupped value of me in a Series and then merge it back
temp = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'].sum())
temp = temp.reset_index(name='weight')
df = df.merge(temp, on=['date', 'rank'])
df['weight'] = df['me'] / df['weight']
df.set_index('date', inplace=True)
df
which should lead to the output:
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.459459
2022-01-02 2.0 100 200.0 2 0.540541
2022-01-02 1.5 100 150.0 1 0.600000
2022-01-02 1.0 100 100.0 1 0.400000

Pandas add missing weeks from range to dataframe

I am computing a DataFrame with weekly amounts and now I need to fill it with missing weeks from a provided date range.
This is how I'm generating the dataframe with the weekly amounts:
df['date'] = pd.to_datetime(df['date']) - timedelta(days=6)
weekly_data: pd.DataFrame = (df
.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
.reset_index()
)
Which outputs:
date sum
0 2020-10-11 78
1 2020-10-18 673
If a date range is given as start='2020-08-30' and end='2020-10-30', then I would expect the following dataframe:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0.0
So far, I have managed to just add the missing weeks and set the sum to 0, but it also replaces the existing values:
weekly_data = weekly_data.reindex(pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN')).fillna(0)
Which outputs:
date sum
0 2020-08-30 0.0
1 2020-09-06 0.0
2 2020-09-13 0.0
3 2020-09-20 0.0
4 2020-09-27 0.0
5 2020-10-04 0.0
6 2020-10-11 0.0 # should be 78
7 2020-10-18 0.0 # should be 673
8 2020-10-25 0.0
Remove reset_index for DatetimeIndex, because reindex working with index and if RangeIndex get 0 values, because no match:
weekly_data = (df.groupby([pd.Grouper(key='date', freq='W-SUN')])[data_type]
.sum()
)
Then is possible use fill_value=0 parameter and last add reset_index:
r = pd.date_range('2020-08-30', '2020-10-30', freq='W-SUN', name='date')
weekly_data = weekly_data.reindex(r, fill_value=0).reset_index()
print (weekly_data)
date sum
0 2020-08-30 0
1 2020-09-06 0
2 2020-09-13 0
3 2020-09-20 0
4 2020-09-27 0
5 2020-10-04 0
6 2020-10-11 78
7 2020-10-18 673
8 2020-10-25 0

Add Missing Date Index in a multiindex dataframe

I am working with a multi index data frame that has a date column and location_id as indices.
index_1 = ['2020-01-01', '2020-01-03', '2020-01-04']
index_2 = [100,200,300]
index = pd.MultiIndex.from_product([index_1,
index_2], names=['Date', 'location_id'])
df = pd.DataFrame(np.random.randint(10,100,9), index)
df
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
I want to fill in missing dates, with just one location_id and fill it with 0:
0
Date location_id
2020-01-01 100 19
200 75
300 39
2020-01-02 100 0
2020-01-03 100 11
200 91
300 80
2020-01-04 100 36
200 56
300 54
How can I achieve that? This is helpful but only if my data frame was not multi indexed.
you can get unique value of the Date index level, generate all dates between min and max with pd.date_range and use difference with unique value of Date to get the missing one. Then reindex df with the union of the original index and a MultiIndex.from_product made of missing date and the min of the level location_id.
#unique dates
m = df.index.unique(level=0)
# reindex
df = df.reindex(df.index.union(
pd.MultiIndex.from_product([pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]])),
fill_value=0)
print(df)
0
2020-01-01 100 91
200 49
300 19
2020-01-02 100 0
2020-01-03 100 41
200 25
300 51
2020-01-04 100 44
200 40
300 54
instead of pd.MultiIndex.from_product, you can also use product from itertools. Same result but maybe faster.
from itertools import product
df = df.reindex(df.index.union(
list(product(pd.date_range(m.min(), m.max())
.difference(pd.to_datetime(m))
.strftime('%Y-%m-%d'),
[df.index.get_level_values(1).min()]))),
fill_value=0)
Pandas index is immutable, so you need to construct a new index. Put index level location_id to column and get unique rows and call asfreq to create rows for missing date. Assign the result to df2. Finally, use df.align to join both indices and fillna
df1 = df.reset_index(-1)
df2 = df1.loc[~df1.index.duplicated()].asfreq('D').ffill()
df_final = df.align(df2.set_index('location_id', append=True))[0].fillna(0)
Out[75]:
0
Date location_id
2020-01-01 100 19.0
200 75.0
300 39.0
2020-01-02 100 0.0
2020-01-03 100 11.0
200 91.0
300 80.0
2020-01-04 100 36.0
200 56.0
300 54.0
unstack/stack and asfreq/reindex would work:
new_df = df.unstack(fill_value=0)
new_df.index = pd.to_datetime(new_df.index)
new_df.asfreq('D').fillna(0).stack('location_id')
Output:
0
Date location_id
2020-01-01 100 78.0
200 25.0
300 89.0
2020-01-02 100 0.0
200 0.0
300 0.0
2020-01-03 100 79.0
200 23.0
300 11.0
2020-01-04 100 30.0
200 79.0
300 72.0

pandas calculates column value means on groups and means across whole dataframe

I have a df, df['period'] = (df['date1'] - df['date2']) / np.timedelta64(1, 'D')
code y_m date1 date2 period
1000 201701 2017-12-10 2017-12-09 1
1000 201701 2017-12-14 2017-12-12 2
1000 201702 2017-12-15 2017-12-13 2
1000 201702 2017-12-17 2017-12-15 2
2000 201701 2017-12-19 2017-12-18 1
2000 201701 2017-12-12 2017-12-10 2
2000 201702 2017-12-11 2017-12-10 1
2000 201702 2017-12-13 2017-12-12 1
2000 201702 2017-12-11 2017-12-10 1
then groupby code and y_m to calculate the average of date1-date2,
df_avg_period = df.groupby(['code', 'y_m'])['period'].mean().reset_index(name='avg_period')
code y_m avg_period
1000 201701 1.5
1000 201702 2
2000 201701 1.5
2000 201702 1
but I like to convert df_avg_period into a matrix that transposes column code to rows and y_m to columns, like
0 1 2 3
0 -1 0 201701 201702
1 0 1.44 1.44 1.4
2 1000 1.75 1.5 2
3 2000 1.20 1.5 1
-1 represents a dummy value that indicates either a value doesn't exist for a specific code/y_m cell or to maintain matrix shape; 0 represents 'all' values, that averages the code or y_m or code and y_m, e.g. cell (1,1) averages the period values for all rows in df; (1,2) averages the period for 201701 across all rows that have this value for y_m in df.
apparently pivot_table cannot give correct results using mean. so I am wondering how to achieve that correctly?
pivot_table with margins=True
piv = df.pivot_table(
index='code', columns='y_m', values='period', aggfunc='mean', margins=True
)
# housekeeping
(piv.reset_index()
.rename_axis(None, 1)
.rename({'code' : -1, 'All' : 0}, axis=1)
.sort_index(axis=1)
)
-1 0 201701 201702
0 1000 1.750000 1.5 2.0
1 2000 1.200000 1.5 1.0
2 All 1.444444 1.5 1.4

finding intersection of intervals in pandas

I have two dataframes
df_a=
Start Stop Value
0 0 100 0.0
1 101 200 1.0
2 201 1000 0.0
df_b=
Start Stop Value
0 0 50 0.0
1 51 300 1.0
2 301 1000 0.0
I would like to generate a DataFrame which contains the intervals as identified by Start and Stop, where Value was the same in df_a and df_b. For each interval I would like to store: if Value was the same, and which was the value in df_a and df_b.
Desired output:
df_out=
Start Stop SameValue Value_dfA Value_dfB
0 50 1 0 0
51 100 0 0 1
101 200 1 1 1
201 300 0 0 1
[...]
Not sure if this is the best way to do this but you can reindex, join, groupby and agg to get your intervals, e.g.:
Expand each df so that the index is every single value of the range (Start to Stop) using reindex() and padding the values:
In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded
Out[]:
Stop Value
Start
0 100.0 0.0
1 100.0 0.0
2 100.0 0.0
3 100.0 0.0
4 100.0 0.0
...
997 1000.0 0.0
998 1000.0 0.0
999 1000.0 0.0
1000 1000.0 0.0
[1001 rows x 2 columns]
In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')
Join the two expanded dfs:
In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df
Out[]:
Start Stop_dfA Value_dfA Stop_dfB Value_dfB
0 0 100.0 0.0 50.0 0.0
1 1 100.0 0.0 50.0 0.0
2 2 100.0 0.0 50.0 0.0
3 3 100.0 0.0 50.0 0.0
4 4 100.0 0.0 50.0 0.0
...
Note: you can ignore the Stop columns and could have dropped them in the previous step.
There is no standard way to groupby only consecutive values (à la itertools.groupby), so resorting to a cumsum() hack:
In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)
Now you can get the result you want by aggregating the group with min, max:
In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out
Out[]:
Value_dfA Value_dfB Start Stop
0 0.0 0.0 0 50
1 0.0 1.0 51 100
2 1.0 1.0 101 200
3 0.0 1.0 201 300
4 0.0 0.0 301 1000
Now you just have to add the SameValue column and, if desired, order the columns to get the exact output you want:
In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Out[]:
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
This assumes the ranges of the two dataframes are the same, or you will need to handle the NaNs you will get with the join().
I found a way but not sure it is the most efficient. You have the input data:
import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})
First I would create the columns Start and Stop of df_out with:
df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])),
'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})
Then to get the value of dfa (and dfb) associated to the right range(Start,Stop) in a column named Value_dfA (and Value_dfB), I would do:
df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])
To get the column SameValue, do:
df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)
If it matters, you can reorder the columns with:
df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Your output is then
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
I have O(nlog(n)) solution where n is the sum of rows of df_a and df_b. Here's how it goes:
Rename value column of both dataframes to value_a and value_b repsectively. Next append df_b to df_a.
df = df_a.append(df_b)
Sort the df with respect to start column.
df = df.sort_values('start')
Resulting dataframe will look like this:
start stop value_a value_b
0 0 100 0.0 NaN
0 0 50 NaN 0.0
1 51 300 NaN 1.0
1 101 200 1.0 NaN
2 201 1000 0.0 NaN
2 301 1000 NaN 0.0
Forward fill the missing values:
df = df.fillna(method='ffill')
Compute same_value column:
df['same_value'] = df['value_a'] == df['value_b']
Recompute stop column:
df.stop = df.start.shift(-1)
You will get the dataframe you desire (except the first and last row which is pretty easy to fix):
start stop value_a value_b same_value
0 0 0.0 0.0 NaN False
0 0 51.0 0.0 0.0 True
1 51 101.0 0.0 1.0 False
1 101 201.0 1.0 1.0 True
2 201 301.0 0.0 1.0 False
2 301 NaN 0.0 0.0 True
Here is an answer which computes the overlapping intervals really quickly (which answers the question in the title):
from io import StringIO
import pandas as pd
from ncls import NCLS
c1 = StringIO("""Start Stop Value
0 100 0.0
101 200 1.0
201 1000 0.0""")
c2 = StringIO("""Start Stop Value
0 50 0.0
51 300 1.0
301 1000 0.0""")
df1 = pd.read_table(c1, sep="\s+")
df2 = pd.read_table(c2, sep="\s+")
ncls = NCLS(df1.Start.values, df1.Stop.values, df1.index.values)
x1, x2 = ncls.all_overlaps_both(df2.Start.values, df2.Stop.values, df2.index.values)
df1 = df1.reindex(x2).reset_index(drop=True)
df2 = df2.reindex(x1).reset_index(drop=True)
# print(df1)
# print(df2)
df = df1.join(df2, rsuffix="2")
print(df)
# Start Stop Value Start2 Stop2 Value2
# 0 0 100 0.0 0 50 0.0
# 1 0 100 0.0 51 300 1.0
# 2 101 200 1.0 51 300 1.0
# 3 201 1000 0.0 51 300 1.0
# 4 201 1000 0.0 301 1000 0.0
With this final df it should be simple to get to the result you need (but it is left as an exercise for the reader).
See NCLS for the interval overlap data structure.

Categories