I'm trying to collect/concat multiple numpy arrays in a single numpy array. I can do this with pandas data frame as:
df_train = pd.DataFrame()
... loop ...:
df_temp = pd.read_json(file)
df_train = pd.concat([df_train, df_temp], ignore_index=True, axis=0, sort=False)
in a loop. With this I'm able to combine various data in a single data frame.
What I want to do this is with numpy arrays. I tried the same thing as:
nump_train = np.nan
... loop ...:
nump = df_temp.values
nump_train = np.concatenate((nump_train, nump))
but I cannot concat zero-dimensional arrays as the error message says (ValueError: zero-dimensional arrays cannot be concatenated)
How can I do this like in pandas?
ps: I can solve this with a bad-written hard-coded code as:
w=1
for loop:
if w == 1:
nump1 = sc.transform(df_temp.drop(['time'], axis=1))
elif w == 2:
nump2 = sc.transform(df_temp.drop(['time','trend'], axis=1))
elif w == 3:
nump3 = sc.transform(df_temp.drop(['time'], axis=1))
w += 1
X_train = np.concatenate((nump1, nump2, nump3), axis = 0)
Bu this bad coding and I cannot scale this in a loop.
EDIT 1:
Actual code is this:
w = 1
for i in range(1, loop_size+1):
df_train = pd.DataFrame()
nump_train = np.nan
random_list = random.sample(file_list, selection)
for json in random_list:
json_name = json[:json.index('_')]
df_temp = pd.read_json(filedir + json)
train_period_mask = (df_temp['time'] > train_start_date) & (df_temp['time'] < train_end_date)
df_temp = df_temp.loc[train_period_mask]
df_temp.index = np.arange(0, len(df_temp))
df_temp = calc_(df_temp)
df_temp['trend'] = zg(df_temp, zg_ratio)
df_temp['trend_shifted'] = df_temp.trend.shift(-1)
df_temp = df_temp.dropna()
nump = sc.fit_transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
if w == 1:
nump1 = sc.transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
elif w == 2:
nump2 = sc.transform(df_temp.drop(['time','trend_shifted','trend'], axis=1))
elif w == 3:
nump3 = sc.transform(df_temp.drop(['time_period_start','trend_shifted','trend'], axis=1))
df_train = pd.concat([df_train, df_temp], ignore_index=True, axis=0, sort=False)
nump_train.append(nump)
w += 1
drop_list = ['time_period_start']
df_train.drop(drop_list, 1, inplace = True )
start = timeit.default_timer()
sc = MinMaxScaler()
X_train = sc.fit_transform(df_train.drop(['trend','trend_shifted'], axis=1))
X_train2 = np.concatenate((nump1, nump2, nump3), axis = 0)
y_train = df_train['trend_shifted'].values
I want X_train and X_train2 to have the same shape.
Related
I am trying to get RF feature importance, I fit the random forest on the data like this:
model = RandomForestRegressor()
n = model.fit(self.X_train,self.y_train)
if n is not None:
df = pd.DataFrame(data = n , columns = ["Feature","Importance_Score"])
df["Feature_Name"] = np.array(self.X_Headers)
df = df.drop(["Feature"], axis = 1)
df[["Feature_Name","Importance_Score"]].to_csv("RF_Importances.csv", index = False)
del df
However, the n variable returns None, why is this happening?
Not very sure how model.fit(self.X_train,self.y_train) is supposed to work. Need more information about how you set up the model.
If we set this up using simulated data, it works:
np.random.seed(111)
X = pd.DataFrame(np.random.normal(0,1,(100,5)),columns=['A','B','C','D','E'])
y = np.random.normal(0,1,100)
model = RandomForestRegressor()
n = model.fit(X,y)
if n is not None:
df = pd.DataFrame({'features':X.columns,'importance':n.feature_importances_})
df
features importance
0 A 0.176091
1 B 0.183817
2 C 0.169927
3 D 0.267574
4 E 0.202591
I am trying to run a fama-macbeth regression in a python. As afirst step I am running the time series for every asset in my portfolio but I am unable to run it because I am getting an error:
'ValueError: Must pass DataFrame with boolean values only'
I am relatively new to python and have heavily relied on this forum to help me out. I hope it you can help me with this issue.
Please let me know how I can resolve this. I will be very grateful to you!
I assume this line is producing the error. Cause when I run the function without the for loop, it works perfectly.
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
The dimension of my matrix is 108x35, 30 stocks and 5 factors over 108 points. Hence I want to run a regression for every stock against the 4 factors and store the result of the coeffs in a dataframe. Sample dataframe:
Date BAS GY AI FP SGL GY LNA GY AKZA NA Market Factor
1/29/2010 -5.28% -7.55% -1.23% -5.82% -7.09% -5.82%
2/26/2010 0.04% 13.04% -1.84% 4.06% -14.62% -14.62%
3/31/2010 10.75% 1.32% 7.33% 6.61% 12.21% 12.21%
The following is the entire code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_set = pd.read_excel(r'C:\XXX\Research Project\Data\Regression.xlsx', sheet_name = 'Fama Macbeth')
data_set.set_index(data_set['Date'], inplace=True)
data_set.drop('Date', axis=1, inplace=True)
X = data_set.iloc[:,30:]
y = data_set.iloc[:,:30]
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_results = pd.concat([df_results, df_temp], axis = 0)
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
cols = len(y.columns)
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
ValueError: Must pass DataFrame with boolean values only
For some dataset group_1 I need to iterate over all rows k times for robustness and find a matching random sample of another data frame group_2 according to some criteria expressed as data frame columns.
Unfortunately, this is fairly slow.
How can I improve performance?
The bottleneck is the apply-ed function, i.e. randomMatchingCondition.
import tqdm
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
seed = 47
np.random.seed(seed)
###################################################################
# generate dummy data
size = 10000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] = np.random.randint(0,2, size=size)
df['group_1'] = pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] = pd.Series(np.random.randint(1,10, size=size)).astype(object)
group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
group_0 = group_0.rename(index=str, columns={"metric": "metric_group_0"})
join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric_group_0']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())
###################################################################
# naive find random element matching condition
def randomMatchingCondition(original_element, group_0, join_columns, random_state):
limits_dict = original_element[join_columns_enrich].to_dict()
query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
candidates = group_0.query(query)
if len(candidates) > 0:
return candidates.sample(n=1, random_state=random_state)['metric_group_0'].values[0]
else:
return np.nan
###################################################################
# iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None
for i in range(1, k+1):
group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
args=[group_0, join_columns_enrich, None],
axis = 1)
group_1['run'] = i
if resulting_df is None:
resulting_df = group_1.copy()
else:
resulting_df = pd.concat([resulting_df, group_1])
resulting_df.head()
Experimenting with pre-sorting the data:
group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)
does not show any difference.
IIUC you want to end up with k number of random samples for each row (combination of metrics) in your input dataframe. So why not candidates.sample(n=k, ...), and get rid of the for loop? Alternatively you could concatenate you dataframe k times with pd.concat([group1] * k).
It depends on your real data but I would give a shot for grouping the input dataframe by metric columns with group1.groupby(join_columns_enrich) (if their cardinality is sufficiently low), and apply the random sampling on these groups, picking k * len(group.index) random samples for each. groupby is expensive, OTOH you might save a lot on the iteration/sampling once it's done.
#smiandras, you are correct. Getting rid of the for loop is important.
Variant 1: multiple samples:
def randomMatchingCondition(original_element, group_0, join_columns, k, random_state):
limits_dict = original_element[join_columns_enrich].to_dict()
query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
candidates = group_0.query(query)
if len(candidates) > 0:
return candidates.sample(n=k, random_state=random_state, replace=True)['metric_group_0'].values
else:
return np.nan
###################################################################
# iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None
#######################
# trying to improve performance: sort both dataframes
group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)
#######################
group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
args=[group_0, join_columns_enrich, k, None],
axis = 1)
print(group_1.isnull().sum())
group_1 = group_1[~group_1.metric_group_0.isnull()]
display(group_1.head())
s=pd.DataFrame({'metric_group_0':np.concatenate(group_1.metric_group_0.values)},index=group_1.index.repeat(group_1.metric_group_0.str.len()))
s = s.join(group_1.drop('metric_group_0',1),how='left')
s['pos_in_array'] = s.groupby(s.index).cumcount()
s.head()
Variant 2: all possible samples optimized by native JOIN operation.
WARN this is a bit unsafe as it might generate a gigantic number of rows:
size = 1000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] = np.random.randint(0,2, size=size)
df['group_1'] = pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] = pd.Series(np.random.randint(1,10, size=size)).astype(object)
group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())
df = group_1.merge(group_0, on=join_columns_enrich)
display(df.head())
print(group_1.shape)
df.shape
I'm trying to fill pandas dataframe columns in a for loop. The column name is parametric and assigned by loop value. This is my code:
for k in range (-1, -4, -1):
df_orj = pd.read_csv('something.csv', sep= '\t')
df_train = df_orj.head(11900)
df_test = df_orj.tail(720)
SHIFT = k
df_train.trend = df_train.trend.shift(SHIFT)
df_train = df_train.dropna()
df_test.trend = df_test.trend.shift(SHIFT)
df_test = df_test.dropna()
drop_list = some_list
df_out = df_test[['date','price']]
df_out.index = np.arange(0, len(df_out)) # start index from 0
df_out["pred-1"] = np.nan
df_out["pred-2"] = np.nan
df_out["pred-3"] = np.nan
df_train.drop(drop_list, 1, inplace = True )
df_test.drop(drop_list, 1, inplace = True )
# some processes here
rf = RandomForestClassifier(n_estimators = 10)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("accuracy score: " , rf.score(X_test, y_test))
X_test2 = sc.transform(df_test.drop('trend', axis=1))
y_test2 = df_test['trend'].values
y_pred2 = rf.predict(X_test2)
print("accuracy score: ",rf.score(X_test2, y_test2))
name = "pred{0}".format(k)
for i in range (0, y_test2.size):
df_out[name][i] = y_pred2[i]
df_out.head(20)
And this is my output:
time_period_start price_open pred-1 pred-2 pred-3
697 2018-10-02T02:00:00.0000000Z 86.80 NaN NaN 1.0
698 2018-10-02T03:00:00.0000000Z 86.65 NaN NaN 1.0
699 2018-10-02T04:00:00.0000000Z 86.32 NaN NaN 1.0
As you can see, only pred-3 is filled. How can I fill all 3 pre-defined columns?
If i am understanding correctly, then your issue is that you are getting pred-3
filled only where as other two are nan.
It's because your df_out is in the loop and you are getting the results for last
iteration of loop.
You should define it outside the loop so that you information won't get lost for
the other two.
Your setting those 3 columns as nulls in each loop, so you’re losing those values as it iterates. Either move those initializing columns to before the loop, or you could just initialize with variables with:
Change out
df_out["pred-1"] = np.nan
df_out["pred-2"] = np.nan
df_out["pred-3"] = np.nan
To just initialize the individual column as it loops
name = "pred{0}".format(k)
df_out[name] = np.nan
So full code:
for k in range (-1, -4, -1):
df_orj = pd.read_csv('something.csv', sep= '\t')
df_train = df_orj.head(11900)
df_test = df_orj.tail(720)
SHIFT = k
df_train.trend = df_train.trend.shift(SHIFT)
df_train = df_train.dropna()
df_test.trend = df_test.trend.shift(SHIFT)
df_test = df_test.dropna()
drop_list = some_list
df_out = df_test[['date','price']]
df_out.index = np.arange(0, len(df_out)) # start index from 0
name = "pred{0}".format(k)
df_out[name] = np.nan
df_train.drop(drop_list, 1, inplace = True )
df_test.drop(drop_list, 1, inplace = True )
# some processes here
rf = RandomForestClassifier(n_estimators = 10)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("accuracy score: " , rf.score(X_test, y_test))
X_test2 = sc.transform(df_test.drop('trend', axis=1))
y_test2 = df_test['trend'].values
y_pred2 = rf.predict(X_test2)
print("accuracy score: ",rf.score(X_test2, y_test2))
for i in range (0, y_test2.size):
df_out[name][i] = y_pred2[i]
df_out.head(20)
I have two huge dataframes df_y and df_x.
df_y has columns ['date','ids','Y']. Basically each 'ids' has data for all the 'date'.
df_x has columns ['date','X1','X2','X3','X4','X5','X6'].
df_x has all the date that are in df_y. However some ids might have shorter period, i.e., either starting from a late date or ending
at an early date.
I want to run a rolling linear regression (OLS) Id ~ X1 + X2 + X3 + X4 + X5 + X6 + intercept for each 'ids' in df_y with a lookback of 200 days.
Sample dataframes:
import string, random, pandas as pd, numpy as np
ids = [''.join(random.choice(string.ascii_uppercase) for _ in range(3)) for _ in range(200)]
dates = pd.date_range('2000-01-01', '2017-07-02')
df_dates = pd.DataFrame({'date':dates, 'joinC':len(dates)*[2]})
df_ids = pd.DataFrame({'ids':ids, 'joinC':len(ids)*[2]})
df_values = pd.DataFrame({'Y':np.random.normal(size =
len(dates)*len(ids))})
df_y = df_dates.merge(df_ids, on='joinC', how="outer")
df_y = df_y[['date', 'ids']].merge(df_values, left_index=True,
right_index=True, how="inner")
df_y = df_y.sort_values(['date', 'ids'], ascending=[True, True])
df_x = pd.DataFrame({'date':dates, 'X1':np.random.normal(size = len(dates)), 'X2':np.random.normal(size = len(dates)), 'X3':np.random.normal(size = len(dates)), 'X4':np.random.normal(size = len(dates)), 'X5':np.random.normal(size = len(dates)), 'X6':np.random.normal(size = len(dates))})
My attempt:
import statsmodels.api as sm
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
for i in range(200, len(dates) +1):
for id in ids:
s_date = dates[i - 200]
e_date = dates[i - 1]
Y = df_y[(df_y['date'] >= s_date) & (df_y['date'] <= e_date) & (df_y['ids'] == id)]['Y']
Y = Y.reset_index()['Y']
X = df_x[(df_x['date'] >= s_date) & (df_x['date'] <= e_date)]
X = X.reset_index()[['X1','X2','X3','X4','X5','X6']]
X = sm.add_constant(X)
if len(X) <> len(Y):
continue
regr = sm.OLS(Y, X).fit() #Hangs here after 2 years.
X_pr = X.tail(1)
Y_hat = regr.predict(X_pr)
Y.loc[(df_y['date'] == e_date) & (df_y['ids'] == id), 'Y_hat'] = Y_hat.tolist()[0]
My attempt above seems to be working fine up until the point where it hangs (most likely at fitting step) after running for approx. 2 years. I am inclined to use statsmodels since it supports regularization (planning for future work). However, if using other library makes it faster or more elegant then I am fine with it too. Could someone please help define the fastest solution that doesn't hang midway. Thanks a lot.
I was able to get this workaround using Pandas MovingOLS
import pandas as pd
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
Y_hats = []
for id in ids:
Y = df_y[(df_y['ids'] == id)][['date', 'ids', 'Y']]
Y = Y.merge(df_x, how='left', on=['date'])
X_cols = list(df_x.columns).remove['date']
model = pd.stats.ols.MovingOLS(y=Y['Y'], x=Y[X_cols], window_type='rolling', window=250, intercept=True)
Y['intercept'] = 1
betas = model.beta
betas = betas.multiply(Y[betas.columns], axis='index')
betas = betas.sum(axis=1)
betas = betas[betas > 0]
betas = betas.to_frame()
betas.columns = [['Y_hat']]
betas = betas.merge(Y[['date', 'ids']], how='left', left_index=True, right_index=True)
Y_hats.append(betas)
Y_hats = pd.concat(Y_hats)
Y = Y.merge(Y_hats[['date', 'ids', 'Y_hat'], how='left', on=['date', 'ids']]
There is a straightforward way to use Y['Y_hat'] = model.y_predict if lets say one wants to fit Y ~ X on (y_1, y_2, ... y_n) and (x_1, x_2, ... x_n) but only wants to predict Y_(n+1) using X_(n+1).