Pandas dot product with Multiindex - python

My problem is quite common in finance.
Given an array w (1xN) of weights and a covariance matrix Q (NxN) of assets, one can calculate the covariance of the portfolio using the quadratic expression w' * Q * w, where * is the dot product.
I want to understand what is the best way to perform this operation when I have an history of weights W (T x N) and a 3D structure for covariance matrix (T, N, N).
import numpy as np
import pandas as pd
returns = pd.DataFrame(0.1 * np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
covariance = returns.rolling(20).cov()
weights = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
My solution so far was to converting pandas DataFrames to numpy, perform the calculation doing a loop and then converting back to pandas.
Note that I need to explicitly check for the alignment of labels, since in reality covariance and weights could be calculated by different processes.
cov_dict = {key: covariance.xs(key, axis=0, level=0) for key in covariance.index.get_level_values(0)}
def naive_numpy(weights, cov_dict):
expected_risk = {}
# Extract columns, index before passing to numpy arrays
# Columns
cov_assets = cov_dict[next(iter(cov_dict))].columns
avail_assets = [el for el in cov_assets if el in weights]
# Indexes
cov_dates = list(cov_dict.keys())
avail_dates = weights.index.intersection(cov_dates)
sel_weights = weights.loc[avail_dates, avail_assets]
# Main loop and calculation
for t, value in zip(sel_weights.index, sel_weights.values):
expected_risk[t] = np.sqrt(np.dot(value, np.dot(cov_dict[t].values, value)))
# Back to pandas DataFrame
expected_risk = pd.Series(expected_risk).reindex(weights.index).sort_index()
return expected_risk
Is there pure-pandas way to achieve the same result? Or is there any improvement on the code to make it more efficient? (despite using numpy, it is still quite slow).

I think numpy is definitely the best option. Though you loose that efficiency if you loop on values/dates.
My suggestion for calculating the rolling volatility of a portfolio (with no looping):
returns = pd.DataFrame(0.1 * np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
covariance = returns.rolling(20).cov()
weights = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
rows, columns = weights.shape
# Go to numpy:
w = weights.values
cov = covariance.values.reshape(rows, columns, columns)
A = np.matmul(w.reshape(rows, 1, columns), cov)
var = np.matmul(A, w.reshape(rows, columns, 1)).reshape(rows)
std_dev = np.sqrt(var)
# Back to pandas (in case you want that):
pd.Series(std_dev, index = weights.index)

Related

How do I apply Sympy.Solve to solve a two-sided function on each row in a database?

I am attempting to apply the sympy.solve function to each row in a DataFrame using each column a different variable. I have managed to solve for the undefined variable I need to calculate -- λ -- using the following code:
from sympy import symbols, Eq, solve
import math
Tmin = 33.2067
Tmax = 42.606
D = 19.5526
tmin = 6
tmax = 14
pi = math.pi
λ = symbols('λ')
lhs = D
eq1 = (((Tmax-Tmin)/2) * (λ-tmin) * (1-(2/pi))) + (((Tmax-Tmin)/2)*(tmax-λ)*(1+(2/pi))) - (((Tmax-Tmin)/2)*(tmax-tmin))
eq1 = Eq(lhs, rhs)
lam = solve(eq1)
print(lam)
However, I need to apply this function to every row in a DataFrame and output the result as its own column. The DataFrame is formatted as follows:
import pandas as pd
data = [[6, 14, 33.2067, 42.606, 19.5526], [6, 14, 33.4885, 43.0318, -27.9222]]
df = pd.DataFrame(data, columns=['tmin', 'tmax', 'Tmin', 'Tmax', 'D'])
I have searched for how to do this, but am not sure how to proceed. I managed to find similar questions wherein the answers discussed lambdifying the equation, but I wasn't sure how to lambdify a two-sided equation and then apply it to my DataFrame, and my math skills aren't strong enough to isolate λ and place it on the left side of the equation so that I don't have to lambdify a two-sided equation. Any help here would be appreciated.
You can use the apply method of a dataframe in order to apply a numerical function to each row. But first we need to create the numerical function: we are going to do that with lambdify:
from sympy import symbols, Eq, solve, pi, lambdify
import pandas as pd
# create the necessary symbols
Tmin, Tmax, D, tmin, tmax = symbols("T_min, T_max, D, t_min, t_max")
λ = symbols('λ')
# create the equation and solve for λ
lhs = D
rhs = (((Tmax-Tmin)/2) * (λ-tmin) * (1-(2/pi))) + (((Tmax-Tmin)/2)*(tmax-λ)*(1+(2/pi))) - (((Tmax-Tmin)/2)*(tmax-tmin))
eq1 = Eq(lhs, rhs)
# solve eq1 for λ: take the first (and only) solution
λ_expr = solve(eq1, λ)[0]
print(λ_expr)
# out: (-pi*D + T_max*t_max + T_max*t_min - T_min*t_max - T_min*t_min)/(2*(T_max - T_min))
# convert λ_expr to a numerical function so that it can
# be quickly evaluated. Essentialy, creates:
# λ_func(tmin, tmax, Tmin, Tmax, D)
# NOTE: for simplicity, let's order the symbols like the
# columns of the dataframe
λ_func = lambdify([tmin, tmax, Tmin, Tmax, D], λ_expr)
# We are going to use df.apply to apply the function to each row
# of the dataframe. However, pandas will pass into the current row
# as the argument. For example:
# row = [val_tmin, val_tmax, val_Tmin, val_Tmax, val_D]
# Hence, we need a wrapper function to unpack the row to the
# arguments required by λ_func
wrapper_func = lambda row: λ_func(*row)
# create the dataframe
data = [[6, 14, 33.2067, 42.606, 19.5526], [6, 14, 33.4885, 43.0318, -27.9222]]
df = pd.DataFrame(data, columns=['tmin', 'tmax', 'Tmin', 'Tmax', 'D'])
# apply the function to each row
print(df.apply(wrapper_func, axis=1))
# 0 6.732400044759731
# 1 14.595903848357743
# dtype: float64

Need to use apply or broadcasting and masking to iterate over a DataFrame

I have a data frame that I need to iterate over. I want to use either apply or broadcasting and masking. This is the pseudocode I am trying to improve upon.
2 The algorithm
Algorithm 1: The algorithm
initialize the population (of size n) uniformly randomly, obeying the bounds;
while a pre-determined number of iterations is not complete do
set the random parameters (two independent parameters for each of the d
variables); find the best and the worst vectors in the population;
for each vector in the population do create a new vector using the
current vector, the best vector, the worst vector, and the random
parameters;
if the new vector is at least as good as the current vector then
current vector = new vector;
This is the code I have so far.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
pd.set_option('display.max_columns', 500)
df
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best = final_func.idxmin()
xti_worst = final_func.idxmax()
print(final_func)
print(df.head())
print(df.tail())
*#for loop of pseudocode
#for row in df.iterrows():
#implement equation from assignment
#define in array math
#xi_new = row.to_numpy() + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(row.to_numpy())) - np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_worst].values - np.absolute(row.to_numpy()))
#print(xi_new)*
df2 = df.apply(lambda row: 0 if row == 0 else row + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(axis = 1)))
print(df2)
The formula I am trying to use for xi_new is:
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))
I'm not sure I'm implementing your formula correctly, but hopefully this helps
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best_idx = final_func.idxmin()
xti_worst_idx = final_func.idxmax()
xti_best = df.loc[xti_best_idx]
xti_worst = df.loc[xti_worst_idx]
#Calculate random values for the whole df for the two different areas where you need randomness
nrows,ncols = df.shape
r1 = np.random.uniform(0, 1, size = (nrows, ncols))
r2 = np.random.uniform(0, 1, size = (nrows, ncols))
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))
df= df+r1*xti_best.sub(df.abs())-r2*xti_worst.sub(df.abs())
df

How to stack multiple features of different shapes in NumPy?

What is a better way to do the following codeblock? I want to create a 1d array for each scene containing features a-e to eventually have the shape: m x n if m is the number of scenes and n is the combined length of all the features.
The shape of features a-d is unknown and can be different from each other. For example feature a could have shape 100 x 3 x 3 x 5 while feature b could have shape 30 x 4. Feature e is simply a boolean.
inputs = []
for scene in scenes:
inp = np.concatenate((
scene['a'].flatten(),
scene['b'].flatten(),
scene['c'].flatten(),
scene['d'].flatten(),
[scene['e'] == True]))
inputs.append(inp)
inputs = torch.FloatTensor(inputs)
Let say we know ['a', 'b', 'c', 'd', 'e'] are the only attributes in each scene ( so they're accessible by scene.keys(). Then, the following code works:
output = np.vstack(
np.hstack(map(lambda x: np.array(x).flatten(), s.values())) for s in scenes
)
inputs = torch.FloatTensor(output)
In order to test that, I created a scene generator function that creates dictionaries similar to what you mentioned and synthetically create 10 scenes:
import numpy as np
def scense_generator():
scene = dict()
scene['a'] = np.random.random((100, 3, 3, 5))
scene['b'] = np.random.random((30, 4))
scene['c'] = np.random.random((15, 15, 2))
scene['d'] = np.random.random((2, 2, 2))
scene['e'] = True
return scene
scenes = [scense_generator() for _ in range(10)]
output = np.vstack(
np.hstack(map(lambda x: np.array(x).flatten(), s.values())) for s in scenes
)
print(output.shape)
# (10, 5079)

pd.get_dummies() slow on large levels

I'm unsure if this is already the fastest possible method, or if I'm doing this inefficiently.
I want to hot encode a particular categorical column which has 27k+ possible levels. The column has different values in 2 different datasets, so I combined the levels first before using get_dummies()
def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
col1b = set(df2[column_name].unique())
col1a = set(df[column_name].unique())
combined_cats = list(col1a.union(col1b))
df[column_name] = df[column_name].astype('category', categories=combined_cats)
df2[column_name] = df2[column_name].astype('category', categories=combined_cats)
df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
try:
del df[column_name]
del df2[column_name]
except:
pass
return df,df2
However, Its been running for more than 2 hours and it's still stuck hot encoding.
Could I be doing something wrongly here? Or is it just the nature of running it on large datasets?
Df has 6.8m rows and 27 columns, Df2 has 19990 rows and 27 columns before hot encoding the column that I wanted to.
Advice appreciated, thank you! :)
I reviewed the get_dummies source code briefly, and I think it may not be taking full advantage of the sparsity for your use case. The following approach may be faster, but I did not attempt to scale it all the way up to the 19M records you have:
import numpy as np
import pandas as pd
import scipy.sparse as ssp
np.random.seed(1)
N = 10000
dfa = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N)
, 'col2b': np.random.choice([1, 2, 3], N)
, 'target': np.random.choice([1, 2, 3], N)
})
# construct an array of the unique values of the column to be encoded
vals = np.array(dfa.col1.unique())
# extract an array of values to be encoded from the dataframe
col1 = dfa.col1.values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8)
# do the encoding. NB: This is only vectorized in one of the two dimensions.
# Finding a way to vectorize the second dimension may yield a large speed up
for idx, val in enumerate(vals):
spmtx[np.argwhere(col1 == val), idx] = 1
# Construct a SparseDataFrame from the sparse matrix and apply the index
# from the original dataframe and column names.
dfnew = pd.SparseDataFrame(spmtx, index=dfa.index,
columns=['col1_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)
UPDATE
Borrowing insights from other answers here and here, I was able to vectorize the solution in both dimensions. In my limited testing, I noted that constructing the SparseDataFrame seems to increase the execution time several fold. So, if you don't need to return a DataFrame-like object, you can save a lot of time. This solution also handles the case where you need to encode 2+ DataFrames into 2-d arrays with equal numbers of columns.
import numpy as np
import pandas as pd
import scipy.sparse as ssp
np.random.seed(1)
N1 = 10000
N2 = 100000
dfa = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N1)
, 'col2a': np.random.choice([1, 2, 3], N1)
, 'target': np.random.choice([1, 2, 3], N1)
})
dfb = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N2)
, 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
, 'target': np.random.choice([1, 2, 3], N2)
})
# construct an array of the unique values of the column to be encoded
# taking the union of the values from both dataframes.
valsa = set(dfa.col1.unique())
valsb = set(dfb.col1.unique())
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)
def sparse_ohe(df, col, vals):
"""One-hot encoder using a sparse ndarray."""
colaray = df[col].values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
# do the encoding
spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1
# Construct a SparseDataFrame from the sparse matrix
dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
columns=[col + '_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)
return dfnew
dfanew = sparse_ohe(dfa, 'col1', vals)
dfbnew = sparse_ohe(dfb, 'col1', vals)

Efficient Python Pandas Stock Beta Calculation on Many Dataframes

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to calculate a rolling 12month beta for each stock, I found a post to calculate rolling beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) however when used in my code below takes over 2.5 hours! Considering I can run the exact same calculations in SQL tables in under 3 minutes this is too slow.
How can I improve the performance of my below code to match that of SQL? I understand Pandas/python has that capability. My current method loops over each row which I know slows performance but I am unaware of any aggregate way to perform a rolling window beta calculation on a dataframe.
Note: the first 2 steps of loading the CSVs into individual dataframes and calculating daily returns only takes ~20seconds. All my CSV dataframes are stored in the dictionary called 'FilesLoaded' with names such as 'XAO'.
Your help would be much appreciated!
Thank you :)
import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10) #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()
MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS
# ***********************************************************************************************
for File in FilesLoaded:
FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:]
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan
for File in FilesLoaded:
df_join['stock'].update(FilesLoaded[File]['Return'])
df_join = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.fillna(0) #get rid of the NaNs in the return data
FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)
# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))
Generate Random Stock Data
20 Years of Monthly Data for 4,000 Stocks
dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
df.iloc[:5, :5]
Roll Function
Returns groupby object ready to apply custom functions
See Source
def roll(df, w):
# stack df.values w-times shifted once at each stack
roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
# roll_array is now a 3-D array and can be read into
# a pandas panel object
panel = pd.Panel(roll_array,
items=df.index[w-1:],
major_axis=df.columns,
minor_axis=pd.Index(range(w), name='roll'))
# convert to dataframe and pivot + groupby
# is now ready for any action normally performed
# on a groupby object
return panel.to_frame().unstack().T.groupby(level=0)
Beta Function
Use closed form solution of OLS regression
Assume column 0 is market
See Source
def beta(df):
# first column is the market
X = df.values[:, [0]]
# prepend a column of ones for the intercept
X = np.concatenate([np.ones_like(X), X], axis=1)
# matrix algebra
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
return pd.Series(b[1], df.columns[1:], name='Beta')
Demonstration
rdf = roll(df, 12)
betas = rdf.apply(beta)
Timing
Validation
Compare calculations with OP
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
print(calc_beta(df.iloc[:12, :2]))
-0.311757542437
print(beta(df.iloc[:12, :2]))
s0001 -0.311758
Name: Beta, dtype: float64
Note the first cell
Is the same value as validated calculations above
betas = rdf.apply(beta)
betas.iloc[:5, :5]
Response to comment
Full working example with simulated multiple dataframes
num_sec_dfs = 4000
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}
market = pd.Series(np.random.rand(480), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = roll(df.pct_change().dropna(), 12).apply(beta)
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0001'].head(20)
Using a generator to improve memory efficiency
Simulated data
m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)
Beta Calculation
def beta(df, market=None):
# If the market values are not passed,
# I'll assume they are located in a column
# named 'Market'. If not, this will fail.
if market is None:
market = df['Market']
df = df.drop('Market', axis=1)
X = market.values.reshape(-1, 1)
X = np.concatenate([np.ones_like(X), X], axis=1)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
return pd.Series(b[1], df.columns, name=df.index[-1])
roll function
This returns a generator and will be far more memory efficient
def roll(df, w):
for i in range(df.shape[0] - w + 1):
yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
Putting it all together
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
Validation
OP beta calc
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
Experiment setup
m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0000'].head(20)
calc_beta(df[['Market', 's0000']])
0.0020118230147777435
NOTE:
The calculations are the same
While efficient subdivision of the input data set into rolling windows is important to the optimization of the overall calculations, the performance of the beta calculation itself can also be significantly improved.
The following optimizes only the subdivision of the data set into rolling windows:
def numpy_betas(x_name, window, returns_data, intercept=True):
if intercept:
ones = numpy.ones(window)
def lstsq_beta(window_data):
x_data = numpy.vstack([window_data[x_name], ones]).T if intercept else window_data[[x_name]]
beta_arr, residuals, rank, s = numpy.linalg.lstsq(x_data, window_data)
return beta_arr[0]
indices = [int(x) for x in numpy.arange(0, returns_data.shape[0] - window + 1, 1)]
return DataFrame(
data=[lstsq_beta(returns_data.iloc[i:(i + window)]) for i in indices]
, columns=list(returns_data.columns)
, index=returns_data.index[window - 1::1]
)
The following also optimizes the beta calculation itself:
def custom_betas(x_name, window, returns_data):
window_inv = 1.0 / window
x_sum = returns_data[x_name].rolling(window, min_periods=window).sum()
y_sum = returns_data.rolling(window, min_periods=window).sum()
xy_sum = returns_data.mul(returns_data[x_name], axis=0).rolling(window, min_periods=window).sum()
xx_sum = numpy.square(returns_data[x_name]).rolling(window, min_periods=window).sum()
xy_cov = xy_sum - window_inv * y_sum.mul(x_sum, axis=0)
x_var = xx_sum - window_inv * numpy.square(x_sum)
betas = xy_cov.divide(x_var, axis=0)[window - 1:]
betas.columns.name = None
return betas
Comparing the performance of the two different calculations, you can see that as the window used in the beta calculation increases, the second method dramatically outperforms the first:
Comparing the performance to that of #piRSquared's implementation, the custom method takes roughly 350 millis to evaluate compared to over 2 seconds.
Further optimizing on #piRSquared's implementation for both speed and memory. the code is also simplified for clarity.
from numpy import nan, ndarray, ones_like, vstack, random
from numpy.lib.stride_tricks import as_strided
from numpy.linalg import pinv
from pandas import DataFrame, date_range
def calc_beta(s: ndarray, m: ndarray):
x = vstack((ones_like(m), m))
b = pinv(x.dot(x.T)).dot(x).dot(s)
return b[1]
def rolling_calc_beta(s_df: DataFrame, m_df: DataFrame, period: int):
result = ndarray(shape=s_df.shape, dtype=float)
l, w = s_df.shape
ls, ws = s_df.values.strides
result[0:period - 1, :] = nan
s_arr = as_strided(s_df.values, shape=(l - period + 1, period, w), strides=(ls, ls, ws))
m_arr = as_strided(m_df.values, shape=(l - period + 1, period), strides=(ls, ls))
for row in range(period, l):
result[row, :] = calc_beta(s_arr[row - period, :], m_arr[row - period])
return DataFrame(data=result, index=s_df.index, columns=s_df.columns)
if __name__ == '__main__':
num_sec_dfs, num_periods = 4000, 480
dates = date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
stocks = DataFrame(data=random.rand(num_periods, num_sec_dfs), index=dates,
columns=['s{:04d}'.format(i) for i in
range(num_sec_dfs)]).pct_change()
market = DataFrame(data=random.rand(num_periods), index=dates, columns=
['Market']).pct_change()
betas = rolling_calc_beta(stocks, market, 12)
%timeit betas = rolling_calc_beta(stocks, market, 12)
335 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
HERE'S THE SIMPLEST AND FASTEST SOLUTION
The accepted answer was too slow for what I needed and the I didn't understand the math behind the solutions asserted as faster. They also gave different answers, though in fairness I probably just messed it up.
I don't think you need to make a custom rolling function to calculate beta with pandas 1.1.4 (or even since at least .19). The below code assumes the data is in the same format as the above problems--a pandas dataframe with a date index, percent returns of some periodicity for the stocks, and market values are located in a column named 'Market'.
If you don't have this format, I recommend joining the stock returns to the market returns to ensure the same index with:
# Use .pct_change() only if joining Close data
beta_data = stock_data.join(market_data), how = 'inner').pct_change().dropna()
After that, it's just covariance divided by variance.
ticker_covariance = beta_data.rolling(window).cov()
# Limit results to the stock (i.e. column name for the stock) vs. 'Market' covariance
ticker_covariance = ticker_covariance.loc[pd.IndexSlice[:, stock], 'Market'].dropna()
benchmark_variance = beta_data['Market'].rolling(window).var().dropna()
beta = ticker_covariance / benchmark_variance
NOTES: If you have a multi-index, you'll have to drop the non-date levels to use the rolling().apply() solution. I only tested this for one stock and one market. If you have multiple stocks, a modification to the ticker_covariance equation after .loc is probably needed. Last, if you want to calculate beta values for the periods before the full window (ex. stock_data begins 1 year ago, but you use 3yrs of data), then you can modify the above to and expanding (instead of rolling) window with the same calculation and then .combine_first() the two.
Created a simple python package finance-calculator based on numpy and pandas to calculate financial ratios including beta. I am using the simple formula (as per investopedia):
beta = covariance(returns, benchmark returns) / variance(benchmark returns)
Covariance and variance are directly calculated in pandas which makes it fast. Using the api in the package is also simple:
import finance_calculator as fc
beta = fc.get_beta(scheme_data, benchmark_data, tail=False)
which will give you a dataframe of date and beta or the last beta value if tail is true.
but these would be blockish when you require beta calculations across the dates(m) for multiple stocks(n) resulting (m x n) number of calculations.
Some relief could be taken by running each date or stock on multiple cores, but then you will end up having huge hardware.
The major time requirement for the solutions available is finding the variance and co-variance and also NaN should be avoided in (Index and stock) data for a correct calculation as per pandas==0.23.0.
Thus running again would result stupid move unless the calculations are cached.
numpy variance and co-variance version also happens to miss-calculate the beta if NaN are not dropped.
A Cython implementation is must for huge set of data.

Categories