Python program uses too much memory

Python program uses too much memory - python

Function coinT() tests if two time series are stationary using ADF test and Hurst exponent. Time series are stored in 1511x6 CSV files, but for testing only a vector of the 5th column is returned by the function stock(). There are 50 files in total. It seems that the program is using too much memory as it makes the PC crash after running for ~30 secs. It works fine on 15 files, but crashes on larger sets(>50).
Can somebody please help me find out what's using so much memory? I've tried splitting computations into multiple functions and deleting the object, but it didn't help much.
import numpy as np
import pandas as pd
import statsmodels.tsa.stattools as ts
import csv
import timeit
from numpy import log, polyfit, sqrt, std, subtract
from pandas.stats.api import ols
import os
src = 'C:/Users/PC/Desktop/Magistr/Ibpython/testing/'
filenames = next(os.walk(src))[2] #load all stock file names into array
cointegratedPairs = []
def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts
H<0.5 - The time series is mean reverting
H=0.5 - The time series is a Geometric Brownian Motion
H>0.5 - The time series is trending"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
del lags
del tau
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0
#Convert file into an array
def stock(filename):
#read file into array and get it's length
delimiter = ","
with open(src + filename,'r') as dest_f:
data_iter = csv.reader(dest_f,
delimiter = delimiter,
quotechar = '"')
data = [data for data in data_iter]
data_array = np.asarray(data)[:,5]
return data_array
del data
del data_array
#Check if two time series are cointegrated
def coinTest(itemX, itemY):
indVar = map(float, stock(itemX)[0:1000]) #2009.05.22 - 2013.05.14
depVar = map(float, stock(itemY)[0:1000]) #2009.05.22 - 2013.05.14
#Calculate optimal hedge ratio "beta"
df = pd.DataFrame()
df[itemX] = indVar
df[itemY] = depVar
res = ols(y=df[itemY], x=df[itemX])
beta_hr = res.beta.x
alpha = res.beta.intercept
df["res"] = df[itemY] - beta_hr*df[itemX] - alpha
#Calculate the CADF test on the residuals
cadf = ts.adfuller(df["res"])
#Reject the null hypothesis at 1% confidence level
if cadf[4]['1%'] > cadf[0]:
#Hurst exponent test if residuals are mean reverting
if hurst(df["res"]) < 0.4:
cointegratedPairs.append((itemY,itemX))
del indVar
del depVar
del df[itemX]
del df[itemY]
del df["res"]
del cadf
#Main function
def coinT():
limit = 0
TotalPairs = 0
for itemX in filenames:
for itemY in filenames[limit:]:
TotalPairs +=1
if itemX == itemY:
next
else:
coinTest(itemX, itemY)
limit +=1

Related

Identify in a pandas series when the trend changes from positive to negative

I have a pandas dataframe with securities prices and several moving average trend lines of various moving average lengths. The data frames are sufficiently large that I would like to identify the most efficient way to capture the index of a particular series where the slope changes (In this example, let's just say from positive to negative for a given series in the dataframe.)
My hack seems very "hacky". I am currently doing the following (Note, imagine this is for a single moving average series):
filter = (df.diff()>0).diff().dropna(axis=0)
new_df = df[filter].dropna(axis=0)
Full example code below:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create a sample Dataframe
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
close = pd.Series([1,2,3,4,2,1,4,3])
df = pd.DataFrame({"date":days, "prices":close})
df.set_index("date", inplace=True)
print("Original DF")
print(df)
# Long Explanation
updays = (df.diff()>0) # Show True for all updays false for all downdays
print("Updays df is")
print(updays)
reversal_df = (updays.diff()) # this will only show change days as True
reversal_df.dropna(axis=0, inplace=True) # Handle the first day
trade_df = df[reversal_df].dropna() # Select only the days where the trend reversed
print("These are the days where the trend reverses it self from negative to positive or vice versa ")
print(trade_df)
# Simplified below by combining the above into two lines
filter = (df.diff()>0).diff().dropna(axis=0)
new_df = df[filter].dropna(axis=0)
print("The final result is this: ")
print(new_df)
Any help here would be appreciated. Note, I'm more interested in balancing efficiencies between how best to do this so I can understand it, and how to make it sufficiently quick to compute.

Multiple moving average solution.
Look for the comment # *** THE SOLUTION BEGINS HERE *** to see the solution, before that is just generating data, printing and plotting to validate.
What I do here is to calculate the sign of MVA slopes so a positive slope will have a value of 1 and a negative slope a value of -1.
Slope_i = MVA(i, ask; periods) - MVA(i, ask; periods)
m<periods>_slp_sgn_i = Sign(Slope_i)
Then to spot slope changes I Calculate:
m<periods>slp_chg = sign(m<periods>_slp_sgn_i - m<periods>_slp_sgn_i-1)
So for example if the slope changes from 1 (positive) to -1 (negative):
sign (-1 - 1) = sign(-2) = -1
In the other hand, if the changes from -1 to 1:
sign (1 - - 1) = sign(2) = 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# GENERATE DATA RANDOM PRICE
_periods = 1000
_value_0 = 1.1300
_std = 0.0005
_freq = '5T'
_output_col = 'ask'
_last_date = pd.to_datetime('2021-12-15')
_p_t = np.zeros(_periods)
_wn = np.random.normal(loc=0, scale=_std, size=_periods)
_p_t[0] = _value_0
_wn[0] = 0
_date_index = pd.date_range(end=_last_date, periods=_periods, freq=_freq)
df= pd.DataFrame(np.stack([_p_t, _wn], axis=1), columns=[_output_col, "wn"], index=_date_index)
for i in range(1, _periods):
df.iloc[i][_output_col] = df.iloc[i - 1][_output_col] + df.iloc[i].wn
print(df.head(5))
# CALCULATE MOVING AVERAGES (3)
df['mva_25'] = df['ask'].rolling(25).mean()
df['mva_50'] = df['ask'].rolling(50).mean()
df['mva_100'] = df['ask'].rolling(100).mean()
# plot to check
df['ask'].plot(figsize=(15,5))
df['mva_25'].plot(figsize=(15,5))
df['mva_50'].plot(figsize=(15,5))
df['mva_100'].plot(figsize=(15,5))
plt.show()
# *** THE SOLUTION BEGINS HERE ***
# calculate mva slopes directions
# positive slope: 1, negative slope -1
df['m25_slp_sgn'] = np.sign(df['mva_25'] - df['mva_25'].shift(1))
df['m50_slp_sgn'] = np.sign(df['mva_50'] - df['mva_50'].shift(1))
df['m100_slp_sgn'] = np.sign(df['mva_100'] - df['mva_100'].shift(1))
# CALCULATE CHANGE IN SLOPE
# from 1 to -1: -1
# from -1 to 1: 1
df['m25_slp_chg'] = np.sign(df['m25_slp_sgn'] - df['m25_slp_sgn'].shift(1))
df['m50_slp_chg'] = np.sign(df['m50_slp_sgn'] - df['m50_slp_sgn'].shift(1))
df['m100_slp_chg'] = np.sign(df['m100_slp_sgn'] - df['m100_slp_sgn'].shift(1))
# clean NAN
df.dropna(inplace=True)
# print data to visually check
print(df.iloc[20:40][['mva_25', 'm25_slp_sgn', 'm25_slp_chg']])
# query where slope of MVA25 changes from positive to negative
df[(df['m25_slp_chg'] == -1)].head(5)
WARNING: Data is generated random so you'll see the plots and the printings change each time you execute the code.

How to speed up a high dimensional loop in python with numpy instead of pandas?

This Loop does its work in 5 hours. How can i speed it up? I read something about using numpy functions instead of pandas. I tried as you can see but i am to new to python to do it right. The big thing here is the high dimensional data with 6000 columns. Every data is static, except of the random weights. How do i write better code?
import numpy as np
import os
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
cov = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100.000
#Empty Resultsmatrix
results_matrix = np.zeros((len(cov.columns)+1, num_portfolios))
rf=0
#Loop corpus
for i in range(num_portfolios):
#Random numbers between 0 and 1 for every column
weights = np.random.uniform(0,1,len(cov.columns))
#Ensure sum of all random numbers is = 1
weights /= np.sum(weights)
#Some easy math operations
portfolio_return = np.sum(mean_returns * weights) * 252
portfolio_std = np.sqrt(np.dot(weights.T, np.dot(cov, weights))) * np.sqrt(252)
sharpe_ratio = (portfolio_return - rf) / portfolio_std
#write sharpe_ratio in result matrix as result for every loop
results_matrix[0,i] = sharpe_ratio
#iterate through the weight vector and add data to results array
for j in range(len(weights)):
results_matrix[j+1,i] = weights[j]
#output table as pandas data frame
output_table = pd.DataFrame(results_matrix.T,columns=['sharpe'] + [ticker for ticker in list(cov.columns)] )```

there is not a generic way to do that, first of all you must identify where your code is slow, and after that you can apply optimization.
First of all you have nested loop so complexity is O(n^2) not a bid deal here, because lot of work can be done using vectorial approach.
In python creation of new object is slow, so for example, if it can be stored in ram, the first np.random.uniform can be done one time and consumed during the cycle.
nested iterator, can be done in vectorial mode, this seem the best candidates for performance.
Anyway i suggest to use a tool like perf_tool that will guide you exactly on the slow piece of code [*]
[*] i'm the main developer of this tool.

#AmilaMGunawardana Here is my first try with tensorflow, but i is not fast enough. At the end i waited 5 hours for 100.000 rounds. Maybe i have to do something better?
Perftool showed me that evrything in the code is fast, except the Part:
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x]))) --> This part takes 90% of the execution Time.
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100000
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
all_weights = np.zeros((num_ports, len(returns.columns))) #tnp.zeros([num_ports,len(returns.columns)], dtype=tnp.float32) #np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)# pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
sharpe_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
multi_randoms = np.random.normal(0, 1., size=(num_portfolios,len(covData.columns) ))
#perf_tool('main')
def main():
for x in range(num_ports):
with PerfTool('preparation1'):
# Save weights
all_weights[x,:] = multi_randoms[x]
with PerfTool('preparation2'):
# Expected return
ret_arr[x] = tnp.sum( (mean_returns * multi_randoms[x] * 252))
with PerfTool('preparation3'):
# Expected volatility
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x])))
with PerfTool('preparation4'):
# Sharpe Ratio
sharpe_arr[x] = ret_arr[x] - rf /vol_arr[x]
PerfTool.set_enabled()
main()
PerfTool.show_stats_if_enabled()```

This showes up one way of getting better with parallel loading. How could i get rid of the loop? Is there a way to do this calculations in just one step with using all_weights Dataframe once instead of looping over it?
import pandas as pd
import numpy as np
from perf_tool import PerfTool, perf_tool
from joblib import Parallel, delayed, parallel_backend
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_ports = 100000
all_weights = np.zeros((num_ports, len(mean_returns.columns)))
#multi_randoms = np.random.random(size=(len(df.columns) ))
for x in range(num_ports):
weights = np.array(np.random.random(len(mean_returns.columns)))
weights = weights/np.sum(weights)
all_weights[x,:] = weights
#print(weights)
#weights = np.array(np.random.random(len(returns.columns)))
#print(all_weights)
#print("cov2 type: ", type(cov2))
#cov = pd.DataFrame(np.random.normal(0, 1., size=(600,600)))
#print("cov type: ", type(cov))
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
#all_weights = np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))
sharpe_arr = pd.to_numeric(np.zeros(num_ports))
##perf_tool('main')
##jit(parallel=True)
def test(x):
#for x in range(num_ports):
#with PerfTool('preparation1'):
# Weights
#weights = np.array(np.random.random(len(returns.columns)))
#with PerfTool('preparation2'):
#weights = weights/np.sum(weights)
#with PerfTool('preparation3'):
# Save weights
weights= all_weights[x]
#with PerfTool('preparation4'):
# Expected return
ret_arr[x] = np.sum( (mean_returns * weights * 252))
#with PerfTool('preparation5'):
# Expected volatility
vol_arr[x] = np.sqrt(np.dot(weights.T, np.dot(covData*252, weights)))
#with PerfTool('preparation6'):
# Sharpe Ratio
return x, ret_arr[x] - rf /vol_arr[x]
#sharpe_arr[x] = (np.sum( (mean_returns * all_weights * 252)) - rf) /(np.sqrt(np.dot(all_weights.T, np.dot(covData*252, all_weights))))
#PerfTool.set_enabled()
sharpe= []
weighttable= []
weighttable, sharpe= zip(*Parallel(n_jobs=-1)([delayed(test)(i) for i in range(num_ports)]))```

Building a weighted histogram using two binary files

I have two binary files that I need to iterate through simultaneously so that the value yielded in one file corresponds correctly (same location) to the value yielded in the other. I'm sorting values into histogram bins and the value from one file corresponds to the weight of the value from the other file.
I tried the following syntax:
import numpy as np
import struct
import matplotlib.pyplot as plt
low = np.inf
high = -np.inf
struct_fmt = 'f'
struct_len = struct.calcsize(struct_fmt)
struct_unpack = struct.Struct(struct_fmt).unpack_from
file = "/projects/current/real-core-snaps/core4_256_velx_0009.bin"
file2 = "/projects/current/real-core-snaps/core4_256_dens_0009.bin"
def read_chunks(f, length):
while True:
data = f.read(length)
if not data: break
yield data
loop = 0
with open(file,"rb") as f:
for chunk in read_chunks(f, struct_len):
x = struct_unpack(chunk)
low = np.minimum(x, low)
high = np.maximum(x, high)
loop += 1
nbins = math.ceil(math.sqrt(loop))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64)
f = open(file,"rb")
f2 = open(file2,"rb")
for chunk1,chunk2 in zip(read_chunks(f, struct_len),read_chunks(f2, struct_len)):
subtotal,e = np.histogram(struct_unpack(chunk1),bins=bin_edges,weights=struct_unpack(chunk2))
total = np.add(total,subtotal,out=total,casting="unsafe")
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
plt.savefig('hist-veldens.svg')
but the histogram produced is ridiculous (see below). What am I doing wrong?
The data files are located at https://drive.google.com/file/d/1fhia2CGzl_aRX9Q9Ng61W-4XJGQe1OCV/view?usp=sharing and https://drive.google.com/file/d/1CrhQjyG2axSFgK9LGytELbxjy3Ndon1S/view?usp=sharing.

The mistake is that total = np.zeros(nbins, np.int64) is assigning an integer type to each of the elements of the array total. Given that subtotal does not contain the count number in a weighted histogram but a float-type, total should also be of type float.

Pandas iteration over rows for features calculation

I have a pandas data frame and I want to calculate some features based on some short_window, long_window and bins values. More specifically, for each different row, I want to calculate some features. In order to do so, I move one row forward the df_long = df.loc[row:long_window+row] such as in the first iteration the pandas data frame for row=0 would be df_long = df.loc[0:50+0] and some features would be calculated based on this data frame, for row=1 would be df_long = df.loc[1:50+1] and some other features would be calculated and continues.
from numpy.random import seed
from numpy.random import randint
import pandas as pd
from joblib import Parallel, delayed
bins = 12
short_window = 10
long_window = 50
# seed random number generator
seed(1)
price = pd.DataFrame({
'DATE_TIME': pd.date_range('2012-01-01', '2012-02-01', freq='30min'),
'value': randint(2, 20, 1489),
'amount': randint(50, 200, 1489)
})
def vap(row, df, short_window, long_window, bins):
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = pd.DataFrame(df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum())
return group_months['amount'].tolist(), df.loc[long_window + row + 1, 'DATE_TIME']
def feature_extraction(data, short_window, long_window, bins):
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
amount, date = zip(*Parallel(n_jobs=4)(delayed(vap)(i, data, short_window, long_window, bins)
for i in range(0, data.shape[0] - long_window - 1)))
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(amount, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')
return data
df = feature_extraction(price, short_window, long_window, bins)
I tried to run it in parallel in order to save time but due to the dimensions of my data, it takes a long of time to finish.
Is there any way to change this iterative process (df_long = df.loc[row:long_window+row]) in order to reduce the computational cost? I was wondering if there is any way to use pandas.rolling but I am not sure how to use it in this case.
Any help would be much appreciated!
Thank you

This is the first try to speed up the calculation. I checked the first 100 rows and found out that the binning variable was always the same. So I managed to do an efficient algorithm with fixed bins. But when I checked the function on the whole data, I found out that there are about 100 lines out of 1489, that had a different binning variable so the solution below deviates in 100 lines from the original answer.
Benchmarking:
My fast function: 28 ms
My precise function: 388 ms
Original function: 12200 ms
So a speed up of around 500 times for the fast function and 20 times for precise function
Fast function code:
def feature_extraction2(data, short_window, long_window, bins):
ls = [f"feature{row + 1}" for row in range(bins)]
binning = pd.cut([2,19], bins, retbins=True)[1]
bin_group = np.digitize(data['value'], binning, right=True)
l_sum = []
for i in range(1, bins+1):
sum1 = ((bin_group == i)*data['amount']).rolling(short_window).sum()
l_sum.append(sum1)
ar_sum = np.array(l_sum).T
ar_shifted = np.empty_like(ar_sum)
ar_shifted[:long_window+1,:] = np.nan
ar_shifted[long_window+1:,:] = ar_sum[long_window:-1,:]
temp = pd.DataFrame(ar_shifted, columns = ls)
data = pd.concat([data,temp], axis = 1, sort = False)
return data
Precise function:
data = price.copy()
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
data.shape[0] - long_window - 1)))
norm_volume = []
date = []
for i in range(0, data.shape[0] - long_window - 1):
row = i
df = data
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum().values
x,y = group_months, df.loc[long_window + row + 1, 'DATE_TIME']
norm_volume.append(x)
date.append(y)
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(norm_volume, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')

Efficient Python Pandas Stock Beta Calculation on Many Dataframes

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to calculate a rolling 12month beta for each stock, I found a post to calculate rolling beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) however when used in my code below takes over 2.5 hours! Considering I can run the exact same calculations in SQL tables in under 3 minutes this is too slow.
How can I improve the performance of my below code to match that of SQL? I understand Pandas/python has that capability. My current method loops over each row which I know slows performance but I am unaware of any aggregate way to perform a rolling window beta calculation on a dataframe.
Note: the first 2 steps of loading the CSVs into individual dataframes and calculating daily returns only takes ~20seconds. All my CSV dataframes are stored in the dictionary called 'FilesLoaded' with names such as 'XAO'.
Your help would be much appreciated!
Thank you :)
import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10) #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()
MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS
# ***********************************************************************************************
for File in FilesLoaded:
FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:]
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan
for File in FilesLoaded:
df_join['stock'].update(FilesLoaded[File]['Return'])
df_join = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.fillna(0) #get rid of the NaNs in the return data
FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)
# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))

Generate Random Stock Data
20 Years of Monthly Data for 4,000 Stocks
dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
df.iloc[:5, :5]
Roll Function
Returns groupby object ready to apply custom functions
See Source
def roll(df, w):
# stack df.values w-times shifted once at each stack
roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
# roll_array is now a 3-D array and can be read into
# a pandas panel object
panel = pd.Panel(roll_array,
items=df.index[w-1:],
major_axis=df.columns,
minor_axis=pd.Index(range(w), name='roll'))
# convert to dataframe and pivot + groupby
# is now ready for any action normally performed
# on a groupby object
return panel.to_frame().unstack().T.groupby(level=0)
Beta Function
Use closed form solution of OLS regression
Assume column 0 is market
See Source
def beta(df):
# first column is the market
X = df.values[:, [0]]
# prepend a column of ones for the intercept
X = np.concatenate([np.ones_like(X), X], axis=1)
# matrix algebra
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
return pd.Series(b[1], df.columns[1:], name='Beta')
Demonstration
rdf = roll(df, 12)
betas = rdf.apply(beta)
Timing
Validation
Compare calculations with OP
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
print(calc_beta(df.iloc[:12, :2]))
-0.311757542437
print(beta(df.iloc[:12, :2]))
s0001 -0.311758
Name: Beta, dtype: float64
Note the first cell
Is the same value as validated calculations above
betas = rdf.apply(beta)
betas.iloc[:5, :5]
Response to comment
Full working example with simulated multiple dataframes
num_sec_dfs = 4000
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}
market = pd.Series(np.random.rand(480), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = roll(df.pct_change().dropna(), 12).apply(beta)
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0001'].head(20)

Using a generator to improve memory efficiency
Simulated data
m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)
Beta Calculation
def beta(df, market=None):
# If the market values are not passed,
# I'll assume they are located in a column
# named 'Market'. If not, this will fail.
if market is None:
market = df['Market']
df = df.drop('Market', axis=1)
X = market.values.reshape(-1, 1)
X = np.concatenate([np.ones_like(X), X], axis=1)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
return pd.Series(b[1], df.columns, name=df.index[-1])
roll function
This returns a generator and will be far more memory efficient
def roll(df, w):
for i in range(df.shape[0] - w + 1):
yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
Putting it all together
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
Validation
OP beta calc
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
Experiment setup
m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0000'].head(20)
calc_beta(df[['Market', 's0000']])
0.0020118230147777435
NOTE:
The calculations are the same

While efficient subdivision of the input data set into rolling windows is important to the optimization of the overall calculations, the performance of the beta calculation itself can also be significantly improved.
The following optimizes only the subdivision of the data set into rolling windows:
def numpy_betas(x_name, window, returns_data, intercept=True):
if intercept:
ones = numpy.ones(window)
def lstsq_beta(window_data):
x_data = numpy.vstack([window_data[x_name], ones]).T if intercept else window_data[[x_name]]
beta_arr, residuals, rank, s = numpy.linalg.lstsq(x_data, window_data)
return beta_arr[0]
indices = [int(x) for x in numpy.arange(0, returns_data.shape[0] - window + 1, 1)]
return DataFrame(
data=[lstsq_beta(returns_data.iloc[i:(i + window)]) for i in indices]
, columns=list(returns_data.columns)
, index=returns_data.index[window - 1::1]
)
The following also optimizes the beta calculation itself:
def custom_betas(x_name, window, returns_data):
window_inv = 1.0 / window
x_sum = returns_data[x_name].rolling(window, min_periods=window).sum()
y_sum = returns_data.rolling(window, min_periods=window).sum()
xy_sum = returns_data.mul(returns_data[x_name], axis=0).rolling(window, min_periods=window).sum()
xx_sum = numpy.square(returns_data[x_name]).rolling(window, min_periods=window).sum()
xy_cov = xy_sum - window_inv * y_sum.mul(x_sum, axis=0)
x_var = xx_sum - window_inv * numpy.square(x_sum)
betas = xy_cov.divide(x_var, axis=0)[window - 1:]
betas.columns.name = None
return betas
Comparing the performance of the two different calculations, you can see that as the window used in the beta calculation increases, the second method dramatically outperforms the first:
Comparing the performance to that of #piRSquared's implementation, the custom method takes roughly 350 millis to evaluate compared to over 2 seconds.

Further optimizing on #piRSquared's implementation for both speed and memory. the code is also simplified for clarity.
from numpy import nan, ndarray, ones_like, vstack, random
from numpy.lib.stride_tricks import as_strided
from numpy.linalg import pinv
from pandas import DataFrame, date_range
def calc_beta(s: ndarray, m: ndarray):
x = vstack((ones_like(m), m))
b = pinv(x.dot(x.T)).dot(x).dot(s)
return b[1]
def rolling_calc_beta(s_df: DataFrame, m_df: DataFrame, period: int):
result = ndarray(shape=s_df.shape, dtype=float)
l, w = s_df.shape
ls, ws = s_df.values.strides
result[0:period - 1, :] = nan
s_arr = as_strided(s_df.values, shape=(l - period + 1, period, w), strides=(ls, ls, ws))
m_arr = as_strided(m_df.values, shape=(l - period + 1, period), strides=(ls, ls))
for row in range(period, l):
result[row, :] = calc_beta(s_arr[row - period, :], m_arr[row - period])
return DataFrame(data=result, index=s_df.index, columns=s_df.columns)
if __name__ == '__main__':
num_sec_dfs, num_periods = 4000, 480
dates = date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
stocks = DataFrame(data=random.rand(num_periods, num_sec_dfs), index=dates,
columns=['s{:04d}'.format(i) for i in
range(num_sec_dfs)]).pct_change()
market = DataFrame(data=random.rand(num_periods), index=dates, columns=
['Market']).pct_change()
betas = rolling_calc_beta(stocks, market, 12)
%timeit betas = rolling_calc_beta(stocks, market, 12)
335 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

HERE'S THE SIMPLEST AND FASTEST SOLUTION
The accepted answer was too slow for what I needed and the I didn't understand the math behind the solutions asserted as faster. They also gave different answers, though in fairness I probably just messed it up.
I don't think you need to make a custom rolling function to calculate beta with pandas 1.1.4 (or even since at least .19). The below code assumes the data is in the same format as the above problems--a pandas dataframe with a date index, percent returns of some periodicity for the stocks, and market values are located in a column named 'Market'.
If you don't have this format, I recommend joining the stock returns to the market returns to ensure the same index with:
# Use .pct_change() only if joining Close data
beta_data = stock_data.join(market_data), how = 'inner').pct_change().dropna()
After that, it's just covariance divided by variance.
ticker_covariance = beta_data.rolling(window).cov()
# Limit results to the stock (i.e. column name for the stock) vs. 'Market' covariance
ticker_covariance = ticker_covariance.loc[pd.IndexSlice[:, stock], 'Market'].dropna()
benchmark_variance = beta_data['Market'].rolling(window).var().dropna()
beta = ticker_covariance / benchmark_variance
NOTES: If you have a multi-index, you'll have to drop the non-date levels to use the rolling().apply() solution. I only tested this for one stock and one market. If you have multiple stocks, a modification to the ticker_covariance equation after .loc is probably needed. Last, if you want to calculate beta values for the periods before the full window (ex. stock_data begins 1 year ago, but you use 3yrs of data), then you can modify the above to and expanding (instead of rolling) window with the same calculation and then .combine_first() the two.

Created a simple python package finance-calculator based on numpy and pandas to calculate financial ratios including beta. I am using the simple formula (as per investopedia):
beta = covariance(returns, benchmark returns) / variance(benchmark returns)
Covariance and variance are directly calculated in pandas which makes it fast. Using the api in the package is also simple:
import finance_calculator as fc
beta = fc.get_beta(scheme_data, benchmark_data, tail=False)
which will give you a dataframe of date and beta or the last beta value if tail is true.

but these would be blockish when you require beta calculations across the dates(m) for multiple stocks(n) resulting (m x n) number of calculations.
Some relief could be taken by running each date or stock on multiple cores, but then you will end up having huge hardware.
The major time requirement for the solutions available is finding the variance and co-variance and also NaN should be avoided in (Index and stock) data for a correct calculation as per pandas==0.23.0.
Thus running again would result stupid move unless the calculations are cached.
numpy variance and co-variance version also happens to miss-calculate the beta if NaN are not dropped.
A Cython implementation is must for huge set of data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python program uses too much memory - python

Related

Identify in a pandas series when the trend changes from positive to negative

How to speed up a high dimensional loop in python with numpy instead of pandas?

Building a weighted histogram using two binary files

Pandas iteration over rows for features calculation

Efficient Python Pandas Stock Beta Calculation on Many Dataframes

Categories

Resources