Improve performance calculating a random sample matching specific conditions in pandas

Improve performance calculating a random sample matching specific conditions in pandas - python

For some dataset group_1 I need to iterate over all rows k times for robustness and find a matching random sample of another data frame group_2 according to some criteria expressed as data frame columns.
Unfortunately, this is fairly slow.
How can I improve performance?
The bottleneck is the apply-ed function, i.e. randomMatchingCondition.
import tqdm
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
seed = 47
np.random.seed(seed)
###################################################################
# generate dummy data
size = 10000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] = np.random.randint(0,2, size=size)
df['group_1'] = pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] = pd.Series(np.random.randint(1,10, size=size)).astype(object)
group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
group_0 = group_0.rename(index=str, columns={"metric": "metric_group_0"})
join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric_group_0']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())
###################################################################
# naive find random element matching condition
def randomMatchingCondition(original_element, group_0, join_columns, random_state):
limits_dict = original_element[join_columns_enrich].to_dict()
query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
candidates = group_0.query(query)
if len(candidates) > 0:
return candidates.sample(n=1, random_state=random_state)['metric_group_0'].values[0]
else:
return np.nan
###################################################################
# iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None
for i in range(1, k+1):
group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
args=[group_0, join_columns_enrich, None],
axis = 1)
group_1['run'] = i
if resulting_df is None:
resulting_df = group_1.copy()
else:
resulting_df = pd.concat([resulting_df, group_1])
resulting_df.head()
Experimenting with pre-sorting the data:
group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)
does not show any difference.

IIUC you want to end up with k number of random samples for each row (combination of metrics) in your input dataframe. So why not candidates.sample(n=k, ...), and get rid of the for loop? Alternatively you could concatenate you dataframe k times with pd.concat([group1] * k).
It depends on your real data but I would give a shot for grouping the input dataframe by metric columns with group1.groupby(join_columns_enrich) (if their cardinality is sufficiently low), and apply the random sampling on these groups, picking k * len(group.index) random samples for each. groupby is expensive, OTOH you might save a lot on the iteration/sampling once it's done.

#smiandras, you are correct. Getting rid of the for loop is important.
Variant 1: multiple samples:
def randomMatchingCondition(original_element, group_0, join_columns, k, random_state):
limits_dict = original_element[join_columns_enrich].to_dict()
query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
candidates = group_0.query(query)
if len(candidates) > 0:
return candidates.sample(n=k, random_state=random_state, replace=True)['metric_group_0'].values
else:
return np.nan
###################################################################
# iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None
#######################
# trying to improve performance: sort both dataframes
group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)
#######################
group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
args=[group_0, join_columns_enrich, k, None],
axis = 1)
print(group_1.isnull().sum())
group_1 = group_1[~group_1.metric_group_0.isnull()]
display(group_1.head())
s=pd.DataFrame({'metric_group_0':np.concatenate(group_1.metric_group_0.values)},index=group_1.index.repeat(group_1.metric_group_0.str.len()))
s = s.join(group_1.drop('metric_group_0',1),how='left')
s['pos_in_array'] = s.groupby(s.index).cumcount()
s.head()
Variant 2: all possible samples optimized by native JOIN operation.
WARN this is a bit unsafe as it might generate a gigantic number of rows:
size = 1000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] = np.random.randint(0,2, size=size)
df['group_1'] = pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] = pd.Series(np.random.randint(1,10, size=size)).astype(object)
group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())
df = group_1.merge(group_0, on=join_columns_enrich)
display(df.head())
print(group_1.shape)
df.shape

Related

Pandas iteration over rows for features calculation

I have a pandas data frame and I want to calculate some features based on some short_window, long_window and bins values. More specifically, for each different row, I want to calculate some features. In order to do so, I move one row forward the df_long = df.loc[row:long_window+row] such as in the first iteration the pandas data frame for row=0 would be df_long = df.loc[0:50+0] and some features would be calculated based on this data frame, for row=1 would be df_long = df.loc[1:50+1] and some other features would be calculated and continues.
from numpy.random import seed
from numpy.random import randint
import pandas as pd
from joblib import Parallel, delayed
bins = 12
short_window = 10
long_window = 50
# seed random number generator
seed(1)
price = pd.DataFrame({
'DATE_TIME': pd.date_range('2012-01-01', '2012-02-01', freq='30min'),
'value': randint(2, 20, 1489),
'amount': randint(50, 200, 1489)
})
def vap(row, df, short_window, long_window, bins):
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = pd.DataFrame(df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum())
return group_months['amount'].tolist(), df.loc[long_window + row + 1, 'DATE_TIME']
def feature_extraction(data, short_window, long_window, bins):
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
amount, date = zip(*Parallel(n_jobs=4)(delayed(vap)(i, data, short_window, long_window, bins)
for i in range(0, data.shape[0] - long_window - 1)))
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(amount, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')
return data
df = feature_extraction(price, short_window, long_window, bins)
I tried to run it in parallel in order to save time but due to the dimensions of my data, it takes a long of time to finish.
Is there any way to change this iterative process (df_long = df.loc[row:long_window+row]) in order to reduce the computational cost? I was wondering if there is any way to use pandas.rolling but I am not sure how to use it in this case.
Any help would be much appreciated!
Thank you

This is the first try to speed up the calculation. I checked the first 100 rows and found out that the binning variable was always the same. So I managed to do an efficient algorithm with fixed bins. But when I checked the function on the whole data, I found out that there are about 100 lines out of 1489, that had a different binning variable so the solution below deviates in 100 lines from the original answer.
Benchmarking:
My fast function: 28 ms
My precise function: 388 ms
Original function: 12200 ms
So a speed up of around 500 times for the fast function and 20 times for precise function
Fast function code:
def feature_extraction2(data, short_window, long_window, bins):
ls = [f"feature{row + 1}" for row in range(bins)]
binning = pd.cut([2,19], bins, retbins=True)[1]
bin_group = np.digitize(data['value'], binning, right=True)
l_sum = []
for i in range(1, bins+1):
sum1 = ((bin_group == i)*data['amount']).rolling(short_window).sum()
l_sum.append(sum1)
ar_sum = np.array(l_sum).T
ar_shifted = np.empty_like(ar_sum)
ar_shifted[:long_window+1,:] = np.nan
ar_shifted[long_window+1:,:] = ar_sum[long_window:-1,:]
temp = pd.DataFrame(ar_shifted, columns = ls)
data = pd.concat([data,temp], axis = 1, sort = False)
return data
Precise function:
data = price.copy()
# Vap feature extraction
ls = [f"feature{row + 1}" for row in range(bins)]
data.shape[0] - long_window - 1)))
norm_volume = []
date = []
for i in range(0, data.shape[0] - long_window - 1):
row = i
df = data
df_long = df.loc[row:long_window+row]
df_short = df_long.tail(short_window)
binning = pd.cut(df_long['value'], bins, retbins=True)[1]
group_months = df_short['amount'].groupby(pd.cut(df_short['value'], binning)).sum().values
x,y = group_months, df.loc[long_window + row + 1, 'DATE_TIME']
norm_volume.append(x)
date.append(y)
temp = pd.DataFrame(date, columns=['DATE_TIME'])
temp[ls] = pd.DataFrame(norm_volume, index=temp.index)
data = data.merge(temp, on='DATE_TIME', how='outer')

How to resolve Boolean value error in linear regression model in python?

I am trying to run a fama-macbeth regression in a python. As afirst step I am running the time series for every asset in my portfolio but I am unable to run it because I am getting an error:
'ValueError: Must pass DataFrame with boolean values only'
I am relatively new to python and have heavily relied on this forum to help me out. I hope it you can help me with this issue.
Please let me know how I can resolve this. I will be very grateful to you!
I assume this line is producing the error. Cause when I run the function without the for loop, it works perfectly.
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
The dimension of my matrix is 108x35, 30 stocks and 5 factors over 108 points. Hence I want to run a regression for every stock against the 4 factors and store the result of the coeffs in a dataframe. Sample dataframe:
Date BAS GY AI FP SGL GY LNA GY AKZA NA Market Factor
1/29/2010 -5.28% -7.55% -1.23% -5.82% -7.09% -5.82%
2/26/2010 0.04% 13.04% -1.84% 4.06% -14.62% -14.62%
3/31/2010 10.75% 1.32% 7.33% 6.61% 12.21% 12.21%
The following is the entire code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_set = pd.read_excel(r'C:\XXX\Research Project\Data\Regression.xlsx', sheet_name = 'Fama Macbeth')
data_set.set_index(data_set['Date'], inplace=True)
data_set.drop('Date', axis=1, inplace=True)
X = data_set.iloc[:,30:]
y = data_set.iloc[:,:30]
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_results = pd.concat([df_results, df_temp], axis = 0)
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
cols = len(y.columns)
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
ValueError: Must pass DataFrame with boolean values only

Python add 2 multidimensional numpy arrays

I'm trying to collect/concat multiple numpy arrays in a single numpy array. I can do this with pandas data frame as:
df_train = pd.DataFrame()
... loop ...:
df_temp = pd.read_json(file)
df_train = pd.concat([df_train, df_temp], ignore_index=True, axis=0, sort=False)
in a loop. With this I'm able to combine various data in a single data frame.
What I want to do this is with numpy arrays. I tried the same thing as:
nump_train = np.nan
... loop ...:
nump = df_temp.values
nump_train = np.concatenate((nump_train, nump))
but I cannot concat zero-dimensional arrays as the error message says (ValueError: zero-dimensional arrays cannot be concatenated)
How can I do this like in pandas?
ps: I can solve this with a bad-written hard-coded code as:
w=1
for loop:
if w == 1:
nump1 = sc.transform(df_temp.drop(['time'], axis=1))
elif w == 2:
nump2 = sc.transform(df_temp.drop(['time','trend'], axis=1))
elif w == 3:
nump3 = sc.transform(df_temp.drop(['time'], axis=1))
w += 1
X_train = np.concatenate((nump1, nump2, nump3), axis = 0)
Bu this bad coding and I cannot scale this in a loop.
EDIT 1:
Actual code is this:
w = 1
for i in range(1, loop_size+1):
df_train = pd.DataFrame()
nump_train = np.nan
random_list = random.sample(file_list, selection)
for json in random_list:
json_name = json[:json.index('_')]
df_temp = pd.read_json(filedir + json)
train_period_mask = (df_temp['time'] > train_start_date) & (df_temp['time'] < train_end_date)
df_temp = df_temp.loc[train_period_mask]
df_temp.index = np.arange(0, len(df_temp))
df_temp = calc_(df_temp)
df_temp['trend'] = zg(df_temp, zg_ratio)
df_temp['trend_shifted'] = df_temp.trend.shift(-1)
df_temp = df_temp.dropna()
nump = sc.fit_transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
if w == 1:
nump1 = sc.transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
elif w == 2:
nump2 = sc.transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
elif w == 3:
nump3 = sc.transform(df_temp.drop(['time_period_start','trend_shifted','trend'], axis=1))
df_train = pd.concat([df_train, df_temp], ignore_index=True, axis=0, sort=False)
nump_train.append(nump)
w += 1
drop_list = ['time_period_start']
df_train.drop(drop_list, 1, inplace = True )
start = timeit.default_timer()
sc = MinMaxScaler()
X_train = sc.fit_transform(df_train.drop(['trend','trend_shifted'], axis=1))
X_train2 = np.concatenate((nump1, nump2, nump3), axis = 0)
y_train = df_train['trend_shifted'].values
I want X_train and X_train2 to have the same shape.

Unique column of a sparse matrix in Python

What is a good way of identifying the unique columns of a sparse matrix represented in csc_matrix format and the times each column is repeated?
I have no information a priori about the elements of the matrix. It is the result of a sampling with replacement of the columns of another matrix, so I can have duplicated columns both due to the fact that a column is sampled many times either there are duplicated columns in the original matrix. Therefore I cannot apply numpy.unique to the indices of the sampled columns and I think it is not a good choice to convert the entire matrix to a dense format and then apply numpy.unique to it.

You can sort and group by the number of nonzeros in each column. Each group then sort by indices and values and split into blocks of no-change:
import numpy as np
from scipy import sparse
def sparse_unique_columns(M):
M = M.tocsc()
m, n = M.shape
if not M.has_sorted_indices:
M.sort_indices()
if not M.has_canonical_format:
M.sum_duplicates()
sizes = np.diff(M.indptr)
idx = np.argsort(sizes)
Ms = M#sparse.csc_matrix((np.ones((n,)), idx, np.arange(n+1)), (n, n))
ssizes = np.diff(Ms.indptr)
ssizes[1:] -= ssizes[:-1]
grpidx, = np.where(ssizes)
grpidx = np.concatenate([grpidx, [n]])
if ssizes[0] == 0:
counts = [np.array([0, grpidx[0]])]
else:
counts = [np.zeros((1,), int)]
ssizes = ssizes[grpidx[:-1]].cumsum()
for i, ss in enumerate(ssizes):
gil, gir = grpidx[i:i+2]
pl, pr = Ms.indptr[[gil, gir]]
dv = Ms.data[pl:pr].view(f'V{ss*Ms.data.dtype.itemsize}')
iv = Ms.indices[pl:pr].view(f'V{ss*Ms.indices.dtype.itemsize}')
idxi = np.lexsort((dv, iv))
dv = dv[idxi]
iv = iv[idxi]
chng, = np.where(np.concatenate(
[[True], (dv[1:] != dv[:-1]) | (iv[1:] != iv[:-1]), [True]]))
counts.append(np.diff(chng))
idx[gil:gir] = idx[gil:gir][idxi]
counts = np.concatenate(counts)
nu = counts.size - 1
uniques = M#sparse.csc_matrix((np.ones((nu,)), idx[counts[:-1].cumsum()],
np.arange(nu + 1)), (n, nu))
return uniques, idx, counts[1:]
a = np.random.uniform(0, 10, (1000, 200))
a[a>1] = 0
a = sparse.csc_matrix(a)
b = sparse.csc_matrix((np.ones(1000), np.random.randint(0, 200, (1000,)), np.arange(1001)))
c = a#b
unq, idx, cnt = sparse_unique_columns(c)
unqd, idxd, cntd = np.unique(c.A, axis=1, return_counts=True, return_inverse=True)
from timeit import timeit
print('sparse:', timeit(lambda: sparse_unique_columns(c), number=1000), 'ms')
print('dense: ', timeit(lambda: np.unique(c.A, axis=1, return_counts=True), number=100)*10, 'ms')
Sample output:
sparse: 2.735588440205902 ms
dense: 49.32689592242241 ms

Rolling linear regression on large DataFrames

I have two huge dataframes df_y and df_x.
df_y has columns ['date','ids','Y']. Basically each 'ids' has data for all the 'date'.
df_x has columns ['date','X1','X2','X3','X4','X5','X6'].
df_x has all the date that are in df_y. However some ids might have shorter period, i.e., either starting from a late date or ending
at an early date.
I want to run a rolling linear regression (OLS) Id ~ X1 + X2 + X3 + X4 + X5 + X6 + intercept for each 'ids' in df_y with a lookback of 200 days.
Sample dataframes:
import string, random, pandas as pd, numpy as np
ids = [''.join(random.choice(string.ascii_uppercase) for _ in range(3)) for _ in range(200)]
dates = pd.date_range('2000-01-01', '2017-07-02')
df_dates = pd.DataFrame({'date':dates, 'joinC':len(dates)*[2]})
df_ids = pd.DataFrame({'ids':ids, 'joinC':len(ids)*[2]})
df_values = pd.DataFrame({'Y':np.random.normal(size =
len(dates)*len(ids))})
df_y = df_dates.merge(df_ids, on='joinC', how="outer")
df_y = df_y[['date', 'ids']].merge(df_values, left_index=True,
right_index=True, how="inner")
df_y = df_y.sort_values(['date', 'ids'], ascending=[True, True])
df_x = pd.DataFrame({'date':dates, 'X1':np.random.normal(size = len(dates)), 'X2':np.random.normal(size = len(dates)), 'X3':np.random.normal(size = len(dates)), 'X4':np.random.normal(size = len(dates)), 'X5':np.random.normal(size = len(dates)), 'X6':np.random.normal(size = len(dates))})
My attempt:
import statsmodels.api as sm
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
for i in range(200, len(dates) +1):
for id in ids:
s_date = dates[i - 200]
e_date = dates[i - 1]
Y = df_y[(df_y['date'] >= s_date) & (df_y['date'] <= e_date) & (df_y['ids'] == id)]['Y']
Y = Y.reset_index()['Y']
X = df_x[(df_x['date'] >= s_date) & (df_x['date'] <= e_date)]
X = X.reset_index()[['X1','X2','X3','X4','X5','X6']]
X = sm.add_constant(X)
if len(X) <> len(Y):
continue
regr = sm.OLS(Y, X).fit() #Hangs here after 2 years.
X_pr = X.tail(1)
Y_hat = regr.predict(X_pr)
Y.loc[(df_y['date'] == e_date) & (df_y['ids'] == id), 'Y_hat'] = Y_hat.tolist()[0]
My attempt above seems to be working fine up until the point where it hangs (most likely at fitting step) after running for approx. 2 years. I am inclined to use statsmodels since it supports regularization (planning for future work). However, if using other library makes it faster or more elegant then I am fine with it too. Could someone please help define the fastest solution that doesn't hang midway. Thanks a lot.

I was able to get this workaround using Pandas MovingOLS
import pandas as pd
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
Y_hats = []
for id in ids:
Y = df_y[(df_y['ids'] == id)][['date', 'ids', 'Y']]
Y = Y.merge(df_x, how='left', on=['date'])
X_cols = list(df_x.columns).remove['date']
model = pd.stats.ols.MovingOLS(y=Y['Y'], x=Y[X_cols], window_type='rolling', window=250, intercept=True)
Y['intercept'] = 1
betas = model.beta
betas = betas.multiply(Y[betas.columns], axis='index')
betas = betas.sum(axis=1)
betas = betas[betas > 0]
betas = betas.to_frame()
betas.columns = [['Y_hat']]
betas = betas.merge(Y[['date', 'ids']], how='left', left_index=True, right_index=True)
Y_hats.append(betas)
Y_hats = pd.concat(Y_hats)
Y = Y.merge(Y_hats[['date', 'ids', 'Y_hat'], how='left', on=['date', 'ids']]
There is a straightforward way to use Y['Y_hat'] = model.y_predict if lets say one wants to fit Y ~ X on (y_1, y_2, ... y_n) and (x_1, x_2, ... x_n) but only wants to predict Y_(n+1) using X_(n+1).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improve performance calculating a random sample matching specific conditions in pandas - python

Related

Pandas iteration over rows for features calculation

How to resolve Boolean value error in linear regression model in python?

Python add 2 multidimensional numpy arrays

Unique column of a sparse matrix in Python

Rolling linear regression on large DataFrames

Categories

Resources