Pandas - multiple new columns based on cumprod and condition in other column - python

I have a dataframe df with two columns, df["Period"] and df["Return"]. df["Period"] has number from 1, 2, 3, ... n and is increasing. I want to calculate new columns using .cumprod of df["Return"] where df["Period"] >= 1, 2, 3 etc. Note that the number of rows for each unique period is different and not systematic.
So I get n new columns
df["M_1]: is cumprod of df["Return"] for rows df["Period"] >= 1
df["M_2]: is cumprod of df["Return"] for rows df["Period"] >= 2
...
Below my example which is working. The implementation has two drawbacks:
it is very slow for large number of unique periods
it does not work well with pandas method chaining
Any hint of how to speed this up and/or to vectorize this is appreciated
import numpy as np
import pandas as pd
# Create sample data
n = 10
data = {"Period": np.sort(np.random.randint(1,5,n)),
"Returns": np.random.randn(n)/100, }
df = pd.DataFrame(data)
# Slow implementation
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= #period")["Returns"]).cumprod() - 1
df[f"M_{month}"] = cumret
df.head()
This is the expected output:
Period
Returns
M_1
M_2
M_3
M_4
0
1
-0.0268917
-0.0268917
nan
nan
nan
1
1
0.018205
-0.00917625
nan
nan
nan
2
2
0.00505662
-0.00416604
0.00505662
nan
nan
3
2
-8.28544e-05
-0.00424855
0.00497334
nan
nan
4
2
0.00127519
-0.00297878
0.00625488
nan
nan
5
3
-0.00224315
-0.00521524
0.0039977
-0.00224315
nan
6
3
-0.0197291
-0.0248414
-0.0158103
-0.021928
nan
7
3
0.00136592
-0.0235094
-0.0144659
-0.020592
nan
8
4
0.00582897
-0.0178175
-0.00872129
-0.0148831
0.00582897
9
4
0.00260425
-0.0152597
-0.00613975
-0.0123176
0.0084484

Here is how your code performs on my machine (Python 3.10.7, Pandas 1.4.3) in average after 10,000 iterations:
import statistics
import time
import numpy as np
import pandas as pd
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= #period")["Returns"]).cumprod() - 1
df[f"M_{period}"] = cumret
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.00298 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431
With some minor modifications, you can get a ~3x speed improvement:
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
for period in df["Period"].unique():
df[f"M_{period}"] = (
1 + df.loc[df["Period"].ge(period), "Returns"]
).cumprod() - 1
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.001052 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431

Related

pandas bfill by interval to correct missing/invalid entries

so I have a dataframe
df = pandas.DataFrame([[numpy.nan,5],[numpy.nan,5],[2015,5],[2020,5],[numpy.nan,10],[numpy.nan,10],[numpy.nan,10],[2090,10],[2100,10]],columns=["value","interval"])
value interval
0 NaN 5
1 NaN 5
2 2015.0 5
3 2020.0 5
4 NaN 10
5 NaN 10
6 NaN 10
7 2090.0 10
8 2100.0 10
I need to backwards fill the NaN values based on their interval and the first non-nan following that index so the expected output is
value interval
0 2005.0 5 # corrected 2010 - 5(interval)
1 2010.0 5 # corrected 2015 - 5(interval)
2 2015.0 5 # no change ( use this to correct 2 previous rows)
3 2020.0 5 # no change
4 2060.0 10 # corrected 2070 - 10
5 2070.0 10 # corrected 2080 - 10
6 2080.0 10 # corrected 2090 - 10
7 2090.0 10 # no change (use this to correct 3 previous rows)
8 2100.0 10 # no change
I am at a loss as to how i can accomplish this task using pandas/numpy vectorized operations ...
I can do it with a pretty simple loop
last_good_value = None
fixed_values = []
for val,interval in reversed(df.values):
if val == numpy.nan and last_good_value is not None:
fixed_values.append(last_good_value - interval)
last_good_value = fixed_values[-1]
else:
fixed_values.append(val)
if val != numpy.nan:
last_good_value = val
print (reversed(fixed_values))
which strictly speaking works... but i would like to understand a pandas solution that can resolve the value, and avoid the loops (this is quite a big list in reality)
First, get the position of the rows within groups sharing same 'interval' value.
Then, get the last value of each group.
What you are looking for is "last_value - pos * interval"
df = df.reset_index()
grouped_df = df.groupby(['interval'])
df['pos'] = grouped_df['index'].rank(method='first', ascending=False) - 1
df['last'] = grouped_df['value'].transform('last')
df['value'] = df['last'] - df['interval'] * df['pos']
del df['pos'], df['last'], df['index']
Create a grouping Series that groups the last non-null value with all NaN rows before it, by reversing with [::-1]. Then you can bfill and use cumsum to determine how much to subtract off of every row.
s = df['value'].notnull()[::-1].cumsum()
subt = df.loc[df['value'].isnull(), 'interval'][::-1].groupby(s).cumsum()
df['value'] = df.groupby(s)['value'].bfill().subtract(subt, fill_value=0)
value interval
0 2005.0 5
1 2010.0 5
2 2015.0 5
3 2020.0 5
4 2060.0 10
5 2070.0 10
6 2080.0 10
7 2090.0 10
8 2100.0 10
Because subt is subset to only the NaN rows, the fill_value=0 ensures rows with values remain unchanged
print(subt)
#6 10
#5 20
#4 30
#1 5
#0 10
#Name: interval, dtype: int64

What is the most efficient way to check several conditions in columns in a pandas dataframe?

I am working through a pandas dataframe with three relevant columns and 2.7 million rows. The structure is:
key VisitLink dx_filter time
0 1 ddcde14 1 100
1 2 abcde11 1 140
2 3 absdf12 1 50
3 4 ddcde14 0 125
4 5 ddcde14 1 140
data = [[1,'ddcde14',1,100],[2,'abcde11',1,140],[3,'absdf12',1,50],[4,'ddcde14',0,125],[5,'ddcde14',1,140]]
df_example = pd.DataFrame(data,columns = ['key','VisitLink','dx_filter','time'])
I need 3 things to be true:
- VisitLink: matches between the two rows
- dx_filter: is 1 for the first event
- Time: the second event happens within 30 days of the first event
Example: Key 1 will generate Key 4 as a matching record, as it meets all qualifications, but Key 4 will not generate Key 5 because its dx_filter = 0.
I ran a trial where I predicted my method would take 120+ hours to complete and am wondering if there is a way to shorten this to <10 hours or if that is not possible.
def add_readmit_id(df):
df['readmit_id'] = np.nan
def set_id(row):
if row['dx_filter'] ==0:
return np.nan
else:
relevant_df = df.loc[df['VisitLink']==row['VisitLink']]
timeframe_df = relevant_df.loc[(relevant_df['time']>row['time'])&(relevant_df['time']<=row['time']+30)]
next_timeframe = timeframe_df['time'].min()
id_row = timeframe_df.loc[timeframe_df['time']==next_timeframe]
if not id_row.empty:
return id_row.iloc[0]['key']
else:
return np.nan
df['readmit_id'] = df.apply(set_id,axis=1)
return df
df_example = add_readmit_id(df_example)
See above for the code I used to run it #minimum reproducible.
Here's my approach with groupby:
groups = df.groupby('VisitLink')
s = groups['time'].diff(-1).le(30) & df['dx_filter']
df['shifted'] = np.where(s, groups['key'].shift(-1), np.nan)
Output:
key VisitLink dx_filter time shifted
0 1 ddcde14 1 100 4.0
1 2 abcde11 1 140 NaN
2 3 absdf12 1 50 NaN
3 4 ddcde14 0 125 NaN
4 5 ddcde14 1 140 NaN

calculate elapsed time from repeating time series samples (laps)

I have data that is essentially a series of laps where each lap has its own elapsed time, but I am trying to calculate the total elapsed time.
Here's some code that has similar data:
import pandas as pd
import numpy as np
laptime = pd.Series([1,2,3,4,5,1,2,3,4,5,1,2,3,4,5])
lap = pd.Series([1,1,1,1,1,2,2,2,2,2,3,3,3,3,3])
timeblocks = pd.DataFrame({'laptime': laptime, 'lap': lap})
timeblocks['timediff'] = timeblocks.laptime.diff()
timeblocks['elapsed'] =
pd.Series([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
timeblocks
The resulting data looks like:
lap laptime timediff elapsed
0 1 1 NaN 1
1 1 2 1.0 2
2 1 3 1.0 3
3 1 4 1.0 4
4 1 5 1.0 5
5 2 1 -4.0 6
6 2 2 1.0 7
7 2 3 1.0 8
8 2 4 1.0 9
9 2 5 1.0 10
10 3 1 -4.0 11
11 3 2 1.0 12
12 3 3 1.0 13
13 3 4 1.0 14
14 3 5 1.0 15
The elapsed time is what I actually need to calculate. I tried various forms of messing around with the time differentials and cumsum, but am kinda stuck.
Real-world data looks more like the following:
113.81201171875 1
113.86206054688 1
113.912109375 1
113.96215820313 1
0.05126953125 2
0.101318359375 2
0.1513671875 2
In the case of the real world data, the sample rate is about 0.05 sec.
import io, operator, itertools
Assuming the data is in a text file or multiline string:
s = '''113.81201171875 1
113.86206054688 1
113.912109375 1
113.96215820313 1
0.05126953125 2
0.101318359375 2
0.1513671875 2'''
f = io.StringIO(s)
Gather the data into a list; sort the list by lap then time; group the data by lap and extract the largest and smallest time; calculate the elapsed lap time; acumulate.
data = []
for line in f:
time, lap = map(float, line.strip().split())
data.append((time, lap))
lap = operator.itemgetter(1)
time = operator.itemgetter(0)
data.sort(key = operator.itemgetter(1,0))
total = 0
for el, times in itertools.groupby(data, lap):
low, *_, high = map(time, times)
elapsed = high - low
print(f'lap {el}, elapsed time: {elapsed}')
total += elapsed
print(f'total elapsed time: {total}')
>>>
lap 1.0, elapsed time: 0.15014648438000222
lap 2.0, elapsed time: 0.10009765625
total elapsed time: 0.2502441406300022
>>>

Fill the selected row so that they have equal numbers of non-NaN value

My flowchart is illustrated as above, I want to take 2 rows from the original dataset, then import them to another(because I don't want to modify the original data). In the new dataset, check if 2 rows have the same number of non-NaN value (df.iloc[i,:].count()), if not, fill the difference in number by zero, and then continue to perform the operation.
Example:
Original Data:
3 5 5 NaN NaN NaN
1 4 7 5 NaN NaN
NaN NaN 3 6 7 NaN
NaN 3 8 4 11 NaN
3 0 3 7 2 1
Take 2 row i and i+1 and import them to another dataset:
3 5 5 NaN NaN NaN
1 4 7 5 NaN NaN
Because df.iloc[i+1,:].count() != df.iloc[i,:].count() , then the row with more NaN value must be filled like this:
3 5 5 0 NaN NaN
1 4 7 5 NaN NaN
In case of row 3 and 4
NaN 0 3 6 7 NaN
NaN 3 8 4 11 NaN
And then perform the operation.
Here is my code:
for i in range():
process[1,:] = df.iloc[i,:]
process[2,:] = df.iloc[i+1,:]
while True:
if process[1,:].count() == process[2,:].count():
break
else:
if process[1,:].count() > process[2,:].count():
process[2,:] = process[2,:].fillna(value = 0, limit = process[1,:].count() - process[2,:].count())
else:
process[2,:] = process[2,:].fillna(value = 0, limit = process[2,:].count() - process[1,:].count())
A[i,:] = stats.ttest_rel(process[1,:].values, process[2,:].values) #this line is just for the statistical test, you can ignore it
i += 1
My algorithm didn't work, and I feel that they are somehow too clumsy by checking row and row over and over again.
Any suggestion and correction are welcome, thank you very much.
P/s: I want to consecutively perform a statistical test of every row to each other, so before doing so, I need to make them have equal numbers of non-NaN value.
Finally, I can come up with the answer and I want to share it here:
for i in range(5):
process = pd.DataFrame(columns=df.columns)
process = process.append(df.iloc[i,:], ignore_index = True)
process = process.append(df.iloc[i+1,:], ignore_index = True)
while True:
if process.iloc[0,:].count() == process.iloc[1,:].count():
break
else:
if process.iloc[0,:].count() > process.iloc[1,:].count():
process.iloc[1,:] = process.iloc[1,:].fillna(value = 0, limit = process.iloc[0,:].count() - process.iloc[1,:].count())
else:
process.iloc[0,:] = process.iloc[0,:].fillna(value = 0, limit = process.iloc[1,:].count() - process.iloc[0,:].count())
i += 1

Pandas shifted value calculation

I'd like to create a new column, containing values calculated from shifted value in other columns.
As you see the code below, first I created a time series data.
'price' is randomly generated time series data, and momentum is average momentum value of recent 12 periods.
I'd like to add a new columns, containing data with average 'n' momentum value, in which 'n' correspond to the value of df['shift'], not with fixed 12 value in the momentum function, but with the value in the 'shift' column.
How can I do this?
(In the example below, momentum was calculated with fixed 12)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(low=0.8, high=1.3, size=100).cumprod(),columns = ['price'])
df['shift'] = np.random.randint(5, size=100)+3
df
def momentum(x):
init = 0
for i in range(1, 13):
init = x.price / x.price.shift(i) + init
return init / 12
df['momentum'] = momentum(df)
price shift momentum
0 1.069857 3 NaN
1 0.986563 7 NaN
2 0.809052 5 NaN
3 0.991204 3 NaN
4 0.846159 6 NaN
5 0.717344 4 NaN
6 0.599436 3 NaN
7 0.596711 7 NaN
8 0.543450 4 NaN
9 0.511640 3 NaN
10 0.496865 3 NaN
11 0.460142 4 NaN
12 0.435862 4 0.657192
13 0.410519 4 0.665493
14 0.368428 5 0.640927
15 0.335583 7 0.625128
16 0.313470 7 0.635423
17 0.321265 4 0.704990
18 0.319503 7 0.746885
19 0.365991 4 0.900135
20 0.300793 4 0.766266
21 0.274449 6 0.733104
This is my approach
def momentum(shift,price,array,index):
if shift > index:
return 0
else:
init = 0
for i in range(1,int(shift)+1):
init += price / array[int(index)-i]
return init
df = pd.DataFrame(np.random.uniform(low=0.8, high=1.3, size=100).cumprod(),columns = ['price'])
df['shift'] = np.random.randint(5, size=100)+3
df['Index'] = df.index
series = df['price'].tolist()
df['momentum'] = df.apply(lambda row: momentum(row['shift'],row['price'],series,row['Index']),axis=1)
print df

Categories