Pandas iteration over rows for features calculation

Pandas iteration over rows for features calculation - python

I have a pandas data frame and I want to calculate some features based on some short_window, long_window and bins values. More specifically, for each different row, I want to calculate some features. In order to do so, I move one row forward the df_long = df.loc[row:long_window+row] such as in the first iteration the pandas data frame for row=0 would be df_long = df.loc[0:50+0] and some features would be calculated based on this data frame, for row=1 would be df_long = df.loc[1:50+1] and some other features would be calculated and continues.
from numpy.random import seed
from numpy.random import randint
import pandas as pd
from joblib import Parallel, delayed
bins = 12
short_window = 10
long_window = 50
# seed random number generator
seed(1)
price = pd.DataFrame({
'DATE_TIME': pd.date_range('2012-01-01', '2012-02-01', freq='30min'),
'value': randint(2, 20, 1489),
'amount': randint(50, 200, 1489)
})
def vap(row, df, short_window, long_window, bins):
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = pd.DataFrame(df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum())
return group_months['amount'].tolist(), df.loc[long_window + row + 1, 'DATE_TIME']
def feature_extraction(data, short_window, long_window, bins):
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
amount, date = zip(*Parallel(n_jobs=4)(delayed(vap)(i, data, short_window, long_window, bins)
for i in range(0, data.shape[0] - long_window - 1)))
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(amount, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')
return data
df = feature_extraction(price, short_window, long_window, bins)
I tried to run it in parallel in order to save time but due to the dimensions of my data, it takes a long of time to finish.
Is there any way to change this iterative process (df_long = df.loc[row:long_window+row]) in order to reduce the computational cost? I was wondering if there is any way to use pandas.rolling but I am not sure how to use it in this case.
Any help would be much appreciated!
Thank you

This is the first try to speed up the calculation. I checked the first 100 rows and found out that the binning variable was always the same. So I managed to do an efficient algorithm with fixed bins. But when I checked the function on the whole data, I found out that there are about 100 lines out of 1489, that had a different binning variable so the solution below deviates in 100 lines from the original answer.
Benchmarking:
My fast function: 28 ms
My precise function: 388 ms
Original function: 12200 ms
So a speed up of around 500 times for the fast function and 20 times for precise function
Fast function code:
def feature_extraction2(data, short_window, long_window, bins):
ls = [f"feature{row + 1}" for row in range(bins)]
binning = pd.cut([2,19], bins, retbins=True)[1]
bin_group = np.digitize(data['value'], binning, right=True)
l_sum = []
for i in range(1, bins+1):
sum1 = ((bin_group == i)*data['amount']).rolling(short_window).sum()
l_sum.append(sum1)
ar_sum = np.array(l_sum).T
ar_shifted = np.empty_like(ar_sum)
ar_shifted[:long_window+1,:] = np.nan
ar_shifted[long_window+1:,:] = ar_sum[long_window:-1,:]
temp = pd.DataFrame(ar_shifted, columns = ls)
data = pd.concat([data,temp], axis = 1, sort = False)
return data
Precise function:
data = price.copy()
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
data.shape[0] - long_window - 1)))
norm_volume = []
date = []
for i in range(0, data.shape[0] - long_window - 1):
row = i
df = data
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum().values
x,y = group_months, df.loc[long_window + row + 1, 'DATE_TIME']
norm_volume.append(x)
date.append(y)
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(norm_volume, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')

Related

Generating and Storing Samples of an Exponential Distribution with a name for each sample using a loop

I've got a weird question for a class project. Assuming X ~ Exp(Lambda), Lambda=1.6, I have to generate 100 samples of X, with the indices corresponding to the sample size of each generated sample (S1, S2 ... S100). I've worked out a simple loop which generate the required samples in array, but i am not able to rename the array.
First attempt:
import numpy as np
import matplotlib.pyplot as plt
samples = []
for i in range(1,101,1):
samples.append(np.random.exponential(scale= 1/1.6, size= i))
Second attempt:
import numpy as np
import matplotlib.pyplot as plt
for i in range(1,101,1):
samples = np.random.exponential(scale= 1/1.2, size= i)
col = f'samples {i}'
df_samples[col] = exponential_sample
df_samples = pd.DataFrame(samples)
An example how I would like to visualize the data:
# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
df2 = pd.DataFrame(index= ['x1', 'x2'] )
for i in range(1, 51):
exponential_sample = np.random.exponential((1/rate), sample_size)
col = f'sample {i}'
df2[col] = exponential_sample
# Taking a peek at the samples
df2
But instead of having a simple size = 2, I would like to have sample size = i. This way, I will be able to generate 1 rows for the first column (S1), 2 rows for the second column (S2), until I reach 100 rows for the 100th column (S100).

You cannot stick vectors of different lengths easily into a df so your mock-up code would not work, but you can concat one vector at a time:
df = pd.DataFrame()
for i in range(100,10100,100):
tmp = pd.DataFrame({f'S{i}':np.random.exponential(scale= 1/1.2, size= i)})
df = pd.concat([df, tmp], axis=1)

Use a dict instead maybe?
samples = {}
for i in range(100,10100,100):
samples[i] = np.random.exponential(scale= 1/1.2, size= i)
Then you can convert it into a pandas Dataframe if you like.

Need to use apply or broadcasting and masking to iterate over a DataFrame

I have a data frame that I need to iterate over. I want to use either apply or broadcasting and masking. This is the pseudocode I am trying to improve upon.
2 The algorithm
Algorithm 1: The algorithm
initialize the population (of size n) uniformly randomly, obeying the bounds;
while a pre-determined number of iterations is not complete do
set the random parameters (two independent parameters for each of the d
variables); find the best and the worst vectors in the population;
for each vector in the population do create a new vector using the
current vector, the best vector, the worst vector, and the random
parameters;
if the new vector is at least as good as the current vector then
current vector = new vector;
This is the code I have so far.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
pd.set_option('display.max_columns', 500)
df
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best = final_func.idxmin()
xti_worst = final_func.idxmax()
print(final_func)
print(df.head())
print(df.tail())
*#for loop of pseudocode
#for row in df.iterrows():
#implement equation from assignment
#define in array math
#xi_new = row.to_numpy() + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(row.to_numpy())) - np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_worst].values - np.absolute(row.to_numpy()))
#print(xi_new)*
df2 = df.apply(lambda row: 0 if row == 0 else row + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(axis = 1)))
print(df2)
The formula I am trying to use for xi_new is:
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))

I'm not sure I'm implementing your formula correctly, but hopefully this helps
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best_idx = final_func.idxmin()
xti_worst_idx = final_func.idxmax()
xti_best = df.loc[xti_best_idx]
xti_worst = df.loc[xti_worst_idx]
#Calculate random values for the whole df for the two different areas where you need randomness
nrows,ncols = df.shape
r1 = np.random.uniform(0, 1, size = (nrows, ncols))
r2 = np.random.uniform(0, 1, size = (nrows, ncols))
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))
df= df+r1*xti_best.sub(df.abs())-r2*xti_worst.sub(df.abs())
df

How to speed up a high dimensional loop in python with numpy instead of pandas?

This Loop does its work in 5 hours. How can i speed it up? I read something about using numpy functions instead of pandas. I tried as you can see but i am to new to python to do it right. The big thing here is the high dimensional data with 6000 columns. Every data is static, except of the random weights. How do i write better code?
import numpy as np
import os
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
cov = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100.000
#Empty Resultsmatrix
results_matrix = np.zeros((len(cov.columns)+1, num_portfolios))
rf=0
#Loop corpus
for i in range(num_portfolios):
#Random numbers between 0 and 1 for every column
weights = np.random.uniform(0,1,len(cov.columns))
#Ensure sum of all random numbers is = 1
weights /= np.sum(weights)
#Some easy math operations
portfolio_return = np.sum(mean_returns * weights) * 252
portfolio_std = np.sqrt(np.dot(weights.T, np.dot(cov, weights))) * np.sqrt(252)
sharpe_ratio = (portfolio_return - rf) / portfolio_std
#write sharpe_ratio in result matrix as result for every loop
results_matrix[0,i] = sharpe_ratio
#iterate through the weight vector and add data to results array
for j in range(len(weights)):
results_matrix[j+1,i] = weights[j]
#output table as pandas data frame
output_table = pd.DataFrame(results_matrix.T,columns=['sharpe'] + [ticker for ticker in list(cov.columns)] )```

there is not a generic way to do that, first of all you must identify where your code is slow, and after that you can apply optimization.
First of all you have nested loop so complexity is O(n^2) not a bid deal here, because lot of work can be done using vectorial approach.
In python creation of new object is slow, so for example, if it can be stored in ram, the first np.random.uniform can be done one time and consumed during the cycle.
nested iterator, can be done in vectorial mode, this seem the best candidates for performance.
Anyway i suggest to use a tool like perf_tool that will guide you exactly on the slow piece of code [*]
[*] i'm the main developer of this tool.

#AmilaMGunawardana Here is my first try with tensorflow, but i is not fast enough. At the end i waited 5 hours for 100.000 rounds. Maybe i have to do something better?
Perftool showed me that evrything in the code is fast, except the Part:
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x]))) --> This part takes 90% of the execution Time.
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100000
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
all_weights = np.zeros((num_ports, len(returns.columns))) #tnp.zeros([num_ports,len(returns.columns)], dtype=tnp.float32) #np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)# pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
sharpe_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
multi_randoms = np.random.normal(0, 1., size=(num_portfolios,len(covData.columns) ))
#perf_tool('main')
def main():
for x in range(num_ports):
with PerfTool('preparation1'):
# Save weights
all_weights[x,:] = multi_randoms[x]
with PerfTool('preparation2'):
# Expected return
ret_arr[x] = tnp.sum( (mean_returns * multi_randoms[x] * 252))
with PerfTool('preparation3'):
# Expected volatility
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x])))
with PerfTool('preparation4'):
# Sharpe Ratio
sharpe_arr[x] = ret_arr[x] - rf /vol_arr[x]
PerfTool.set_enabled()
main()
PerfTool.show_stats_if_enabled()```

This showes up one way of getting better with parallel loading. How could i get rid of the loop? Is there a way to do this calculations in just one step with using all_weights Dataframe once instead of looping over it?
import pandas as pd
import numpy as np
from perf_tool import PerfTool, perf_tool
from joblib import Parallel, delayed, parallel_backend
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_ports = 100000
all_weights = np.zeros((num_ports, len(mean_returns.columns)))
#multi_randoms = np.random.random(size=(len(df.columns) ))
for x in range(num_ports):
weights = np.array(np.random.random(len(mean_returns.columns)))
weights = weights/np.sum(weights)
all_weights[x,:] = weights
#print(weights)
#weights = np.array(np.random.random(len(returns.columns)))
#print(all_weights)
#print("cov2 type: ", type(cov2))
#cov = pd.DataFrame(np.random.normal(0, 1., size=(600,600)))
#print("cov type: ", type(cov))
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
#all_weights = np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))
sharpe_arr = pd.to_numeric(np.zeros(num_ports))
##perf_tool('main')
##jit(parallel=True)
def test(x):
#for x in range(num_ports):
#with PerfTool('preparation1'):
# Weights
#weights = np.array(np.random.random(len(returns.columns)))
#with PerfTool('preparation2'):
#weights = weights/np.sum(weights)
#with PerfTool('preparation3'):
# Save weights
weights= all_weights[x]
#with PerfTool('preparation4'):
# Expected return
ret_arr[x] = np.sum( (mean_returns * weights * 252))
#with PerfTool('preparation5'):
# Expected volatility
vol_arr[x] = np.sqrt(np.dot(weights.T, np.dot(covData*252, weights)))
#with PerfTool('preparation6'):
# Sharpe Ratio
return x, ret_arr[x] - rf /vol_arr[x]
#sharpe_arr[x] = (np.sum( (mean_returns * all_weights * 252)) - rf) /(np.sqrt(np.dot(all_weights.T, np.dot(covData*252, all_weights))))
#PerfTool.set_enabled()
sharpe= []
weighttable= []
weighttable, sharpe= zip(*Parallel(n_jobs=-1)([delayed(test)(i) for i in range(num_ports)]))```

How to resolve Boolean value error in linear regression model in python?

I am trying to run a fama-macbeth regression in a python. As afirst step I am running the time series for every asset in my portfolio but I am unable to run it because I am getting an error:
'ValueError: Must pass DataFrame with boolean values only'
I am relatively new to python and have heavily relied on this forum to help me out. I hope it you can help me with this issue.
Please let me know how I can resolve this. I will be very grateful to you!
I assume this line is producing the error. Cause when I run the function without the for loop, it works perfectly.
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
The dimension of my matrix is 108x35, 30 stocks and 5 factors over 108 points. Hence I want to run a regression for every stock against the 4 factors and store the result of the coeffs in a dataframe. Sample dataframe:
Date BAS GY AI FP SGL GY LNA GY AKZA NA Market Factor
1/29/2010 -5.28% -7.55% -1.23% -5.82% -7.09% -5.82%
2/26/2010 0.04% 13.04% -1.84% 4.06% -14.62% -14.62%
3/31/2010 10.75% 1.32% 7.33% 6.61% 12.21% 12.21%
The following is the entire code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_set = pd.read_excel(r'C:\XXX\Research Project\Data\Regression.xlsx', sheet_name = 'Fama Macbeth')
data_set.set_index(data_set['Date'], inplace=True)
data_set.drop('Date', axis=1, inplace=True)
X = data_set.iloc[:,30:]
y = data_set.iloc[:,:30]
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_results = pd.concat([df_results, df_temp], axis = 0)
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
cols = len(y.columns)
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
ValueError: Must pass DataFrame with boolean values only

How does one append rows to a dataframe from a loop using Pandas?

I'm running a loop that appends values to an empty dataframe out side of the loop. However, when this is done, the datframe remains empty. I'm not sure what's going on. The goal is to find the power value that results in the lowest sum of squared residuals.
Example code below:
import tweedie
power_list = np.arange(1.3, 2, .01)
mean = 353.77
std = 17298.24
size = 860310
x = tweedie.tweedie(mu = mean, p = 1.5, phi = 50).rvs(len(x))
variance = 299228898.89
sum_ssr_df = pd.DataFrame(columns = ['power', 'dispersion', 'ssr'])
for i in power_list:
power = i
phi = variance/(mean**power)
tvs = tweedie.tweedie(mu = mean, p = power, phi = phi).rvs(len(x))
sort_tvs = np.sort(tvs)
df = pd.DataFrame([x, sort_tvs]).transpose()
df.columns = ['actual', 'random']
df['residual'] = df['actual'] - df['random']
ssr = df['residual']**2
sum_ssr = np.sum(ssr)
df_i = pd.DataFrame([i, phi, sum_ssr])
df_i = df_i.transpose()
df_i.columns = ['power', 'dispersion', 'ssr']
sum_ssr_df.append(df_i)
sum_ssr_df[sum_ssr_df['ssr'] == sum_ssr_df['ssr'].min()]
What exactly am I doing incorrectly?

This code isn't as efficient as is could be as noted by ALollz. When you append, it basically creates a new dataframe in memory (I'm oversimplifying here).
The error in your code is:
sum_ssr_df.append(df_i)
should be:
sum_ssr_df = sum_ssr_df.append(df_i)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas iteration over rows for features calculation - python

Related

Generating and Storing Samples of an Exponential Distribution with a name for each sample using a loop

Need to use apply or broadcasting and masking to iterate over a DataFrame

How to speed up a high dimensional loop in python with numpy instead of pandas?

How to resolve Boolean value error in linear regression model in python?

How does one append rows to a dataframe from a loop using Pandas?

Categories

Resources