I'd like to create a new column, containing values calculated from shifted value in other columns.
As you see the code below, first I created a time series data.
'price' is randomly generated time series data, and momentum is average momentum value of recent 12 periods.
I'd like to add a new columns, containing data with average 'n' momentum value, in which 'n' correspond to the value of df['shift'], not with fixed 12 value in the momentum function, but with the value in the 'shift' column.
How can I do this?
(In the example below, momentum was calculated with fixed 12)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(low=0.8, high=1.3, size=100).cumprod(),columns = ['price'])
df['shift'] = np.random.randint(5, size=100)+3
df
def momentum(x):
init = 0
for i in range(1, 13):
init = x.price / x.price.shift(i) + init
return init / 12
df['momentum'] = momentum(df)
price shift momentum
0 1.069857 3 NaN
1 0.986563 7 NaN
2 0.809052 5 NaN
3 0.991204 3 NaN
4 0.846159 6 NaN
5 0.717344 4 NaN
6 0.599436 3 NaN
7 0.596711 7 NaN
8 0.543450 4 NaN
9 0.511640 3 NaN
10 0.496865 3 NaN
11 0.460142 4 NaN
12 0.435862 4 0.657192
13 0.410519 4 0.665493
14 0.368428 5 0.640927
15 0.335583 7 0.625128
16 0.313470 7 0.635423
17 0.321265 4 0.704990
18 0.319503 7 0.746885
19 0.365991 4 0.900135
20 0.300793 4 0.766266
21 0.274449 6 0.733104
This is my approach
def momentum(shift,price,array,index):
if shift > index:
return 0
else:
init = 0
for i in range(1,int(shift)+1):
init += price / array[int(index)-i]
return init
df = pd.DataFrame(np.random.uniform(low=0.8, high=1.3, size=100).cumprod(),columns = ['price'])
df['shift'] = np.random.randint(5, size=100)+3
df['Index'] = df.index
series = df['price'].tolist()
df['momentum'] = df.apply(lambda row: momentum(row['shift'],row['price'],series,row['Index']),axis=1)
print df
Related
I have a dataframe df with two columns, df["Period"] and df["Return"]. df["Period"] has number from 1, 2, 3, ... n and is increasing. I want to calculate new columns using .cumprod of df["Return"] where df["Period"] >= 1, 2, 3 etc. Note that the number of rows for each unique period is different and not systematic.
So I get n new columns
df["M_1]: is cumprod of df["Return"] for rows df["Period"] >= 1
df["M_2]: is cumprod of df["Return"] for rows df["Period"] >= 2
...
Below my example which is working. The implementation has two drawbacks:
it is very slow for large number of unique periods
it does not work well with pandas method chaining
Any hint of how to speed this up and/or to vectorize this is appreciated
import numpy as np
import pandas as pd
# Create sample data
n = 10
data = {"Period": np.sort(np.random.randint(1,5,n)),
"Returns": np.random.randn(n)/100, }
df = pd.DataFrame(data)
# Slow implementation
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= #period")["Returns"]).cumprod() - 1
df[f"M_{month}"] = cumret
df.head()
This is the expected output:
Period
Returns
M_1
M_2
M_3
M_4
0
1
-0.0268917
-0.0268917
nan
nan
nan
1
1
0.018205
-0.00917625
nan
nan
nan
2
2
0.00505662
-0.00416604
0.00505662
nan
nan
3
2
-8.28544e-05
-0.00424855
0.00497334
nan
nan
4
2
0.00127519
-0.00297878
0.00625488
nan
nan
5
3
-0.00224315
-0.00521524
0.0039977
-0.00224315
nan
6
3
-0.0197291
-0.0248414
-0.0158103
-0.021928
nan
7
3
0.00136592
-0.0235094
-0.0144659
-0.020592
nan
8
4
0.00582897
-0.0178175
-0.00872129
-0.0148831
0.00582897
9
4
0.00260425
-0.0152597
-0.00613975
-0.0123176
0.0084484
Here is how your code performs on my machine (Python 3.10.7, Pandas 1.4.3) in average after 10,000 iterations:
import statistics
import time
import numpy as np
import pandas as pd
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
periods = set(df["Period"])
for period in periods:
cumret = (1 + df.query("Period >= #period")["Returns"]).cumprod() - 1
df[f"M_{period}"] = cumret
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.00298 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431
With some minor modifications, you can get a ~3x speed improvement:
elapsed_time = []
for _ in range(10_000):
start_time = time.time()
for period in df["Period"].unique():
df[f"M_{period}"] = (
1 + df.loc[df["Period"].ge(period), "Returns"]
).cumprod() - 1
elapsed_time.append(time.time() - start_time)
print(f"--- {round(statistics.mean(elapsed_time), 6):2} seconds in average ---")
print(df)
Output:
--- 0.001052 seconds in average ---
Period Returns M_1 M_2 M_4
0 1 -0.008427 -0.008427 NaN NaN
1 1 0.019699 0.011106 NaN NaN
2 2 0.012661 0.023908 0.012661 NaN
3 2 -0.005059 0.018728 0.007538 NaN
4 4 0.025452 0.044657 0.033182 0.025452
5 4 0.010808 0.055948 0.044349 0.036535
6 4 0.004843 0.061062 0.049407 0.041555
7 4 0.005791 0.067207 0.055484 0.047587
8 4 -0.001816 0.065269 0.053568 0.045685
9 4 0.014102 0.080291 0.068425 0.060431
I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)
so I have a dataframe
df = pandas.DataFrame([[numpy.nan,5],[numpy.nan,5],[2015,5],[2020,5],[numpy.nan,10],[numpy.nan,10],[numpy.nan,10],[2090,10],[2100,10]],columns=["value","interval"])
value interval
0 NaN 5
1 NaN 5
2 2015.0 5
3 2020.0 5
4 NaN 10
5 NaN 10
6 NaN 10
7 2090.0 10
8 2100.0 10
I need to backwards fill the NaN values based on their interval and the first non-nan following that index so the expected output is
value interval
0 2005.0 5 # corrected 2010 - 5(interval)
1 2010.0 5 # corrected 2015 - 5(interval)
2 2015.0 5 # no change ( use this to correct 2 previous rows)
3 2020.0 5 # no change
4 2060.0 10 # corrected 2070 - 10
5 2070.0 10 # corrected 2080 - 10
6 2080.0 10 # corrected 2090 - 10
7 2090.0 10 # no change (use this to correct 3 previous rows)
8 2100.0 10 # no change
I am at a loss as to how i can accomplish this task using pandas/numpy vectorized operations ...
I can do it with a pretty simple loop
last_good_value = None
fixed_values = []
for val,interval in reversed(df.values):
if val == numpy.nan and last_good_value is not None:
fixed_values.append(last_good_value - interval)
last_good_value = fixed_values[-1]
else:
fixed_values.append(val)
if val != numpy.nan:
last_good_value = val
print (reversed(fixed_values))
which strictly speaking works... but i would like to understand a pandas solution that can resolve the value, and avoid the loops (this is quite a big list in reality)
First, get the position of the rows within groups sharing same 'interval' value.
Then, get the last value of each group.
What you are looking for is "last_value - pos * interval"
df = df.reset_index()
grouped_df = df.groupby(['interval'])
df['pos'] = grouped_df['index'].rank(method='first', ascending=False) - 1
df['last'] = grouped_df['value'].transform('last')
df['value'] = df['last'] - df['interval'] * df['pos']
del df['pos'], df['last'], df['index']
Create a grouping Series that groups the last non-null value with all NaN rows before it, by reversing with [::-1]. Then you can bfill and use cumsum to determine how much to subtract off of every row.
s = df['value'].notnull()[::-1].cumsum()
subt = df.loc[df['value'].isnull(), 'interval'][::-1].groupby(s).cumsum()
df['value'] = df.groupby(s)['value'].bfill().subtract(subt, fill_value=0)
value interval
0 2005.0 5
1 2010.0 5
2 2015.0 5
3 2020.0 5
4 2060.0 10
5 2070.0 10
6 2080.0 10
7 2090.0 10
8 2100.0 10
Because subt is subset to only the NaN rows, the fill_value=0 ensures rows with values remain unchanged
print(subt)
#6 10
#5 20
#4 30
#1 5
#0 10
#Name: interval, dtype: int64
This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.
You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1
I have the following code in Python:
import numpy as np
import pandas as pd
colum1 = [1,2,3,4,5,6,7,8,9,10,11,12]
colum2 = [10,20,30,40,50,60,70,80,90,100,110,120]
df = pd.DataFrame({
'colum1' : colum1,
'colum2' : colum2
});
df.loc[df.colum1 == 1,'result'] = df['colum2']
for i in range(len(colum2)):
df.result = np.where(df.colum1>1, 5 - (df['colum2'] - df.result.shift(1)), df.result)
the result of df.result is:
colum1 colum2 result
0 1 10 10.0
1 2 20 -5.0
2 3 30 -30.0
3 4 40 -65.0
4 5 50 -110.0
5 6 60 -165.0
6 7 70 -230.0
7 8 80 -305.0
8 9 90 -390.0
9 10 100 -485.0
10 11 110 -590.0
11 12 120 -705.0
I would like to know if there is a method that allows me to obtain the same result without using a cycle for
Your operation is dependent on two things, the previous row in the DataFrame, and the difference between consecutive values in the DataFrame. That hints that the solution will require shift and diff. However, you want to add a small constant to the expanding sum, as well as actually subtract this from each row, not add it.
To set the pieces of the problem up, first create your shifted series, where you add 5:
a = df.colum2.shift().add(5).cumsum().fillna(0)
Now you need the difference between elements in the Series, and fill missing results with their respective value in colum2:
b = df.colum2.diff().fillna(df.colum2)
To get your final result, simply subtract a from b:
b - a
0 10.0
1 -5.0
2 -30.0
3 -65.0
4 -110.0
5 -165.0
6 -230.0
7 -305.0
8 -390.0
9 -485.0
10 -590.0
11 -705.0
Name: colum2, dtype: float64