How is one intended to use the output of the pandas.ewm.cov function. I would presume that there are functions that allow you to directly use it in the form returned for multiplication, but nothing I try seems to work.
For example, suppose I take a minimal use case, stock X and Y returns timeseries in DF1, so we estimate an ewma covariance matrix, then to get the variance estimate for a portfolio of position A and B (given in DF2) I need to compute $x^T C x$, but I can't find the command to do this without writing a for loop?
# Python 3.6, pandas 0.20
import pandas as pd
import numpy as np
np.random.seed(100)
DF1 = pd.DataFrame(dict(X = np.random.normal(size = 100), Y = np.random.normal(size = 100)))
DF2 = pd.DataFrame(dict(A = np.random.normal(size = 100), B = np.random.normal(size = 100)))
COV = DF1.ewm(10).cov()
print(DF1)
print(COV)
# All of the following are invalid
print(COV.dot(DF2))
print(DF2.dot(COV))
print(COV.multiply(DF2))
The best I can figure out is this ugly piece of code
COV.reset_index().rename(columns = dict(level_0 = "index", level_1 = "variable"), inplace = True)
DF2m = pd.melt(DF2.reset_index(), id_vars = "index").sort_values("index")
MDF = pd.merge(COV, DF2m, on=["index", "variable"])
VAR = MDF.groupby("index").apply(lambda x: np.dot(np.dot(x["value"], np.matrix([x["X"], x["Y"]])), x["value"])[0,0])
I hold out hope that there is a nice way to do this...
Related
I've got a weird question for a class project. Assuming X ~ Exp(Lambda), Lambda=1.6, I have to generate 100 samples of X, with the indices corresponding to the sample size of each generated sample (S1, S2 ... S100). I've worked out a simple loop which generate the required samples in array, but i am not able to rename the array.
First attempt:
import numpy as np
import matplotlib.pyplot as plt
samples = []
for i in range(1,101,1):
samples.append(np.random.exponential(scale= 1/1.6, size= i))
Second attempt:
import numpy as np
import matplotlib.pyplot as plt
for i in range(1,101,1):
samples = np.random.exponential(scale= 1/1.2, size= i)
col = f'samples {i}'
df_samples[col] = exponential_sample
df_samples = pd.DataFrame(samples)
An example how I would like to visualize the data:
# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
df2 = pd.DataFrame(index= ['x1', 'x2'] )
for i in range(1, 51):
exponential_sample = np.random.exponential((1/rate), sample_size)
col = f'sample {i}'
df2[col] = exponential_sample
# Taking a peek at the samples
df2
But instead of having a simple size = 2, I would like to have sample size = i. This way, I will be able to generate 1 rows for the first column (S1), 2 rows for the second column (S2), until I reach 100 rows for the 100th column (S100).
You cannot stick vectors of different lengths easily into a df so your mock-up code would not work, but you can concat one vector at a time:
df = pd.DataFrame()
for i in range(100,10100,100):
tmp = pd.DataFrame({f'S{i}':np.random.exponential(scale= 1/1.2, size= i)})
df = pd.concat([df, tmp], axis=1)
Use a dict instead maybe?
samples = {}
for i in range(100,10100,100):
samples[i] = np.random.exponential(scale= 1/1.2, size= i)
Then you can convert it into a pandas Dataframe if you like.
I am trying to use plotnine in python and couldn't do fct_reorder of r in python. Basically I would like to plot categories from the categorical variable to arrange on x axis based on the increasing value from another variable but I am unable to do so.
import numpy as np
import pandas as pd
from plotnine import *
test_df = pd.DataFrame({'catg': ['a','b','c','d','e'],
'val': [3,1,7,2,5]})
test_df['catg'] = test_df['catg'].astype('category')
When I sort & plot this based on .sort_values() then it doesn't rearrange the categories on x axis:
test_df = test_df.sort_values(by = ['val']).reset_index(drop=True)
(ggplot(data = test_df,
mapping = aes(x = test_df.iloc[:, 0], y = test_df['val']))
+ geom_line(linetype = 2)
+ geom_point()
+ labs(title = str('Weight of Evidence by ' + test_df.columns[0]),
x = test_df.columns[0],
y = 'Weight of Evidence')
+ theme(axis_text_x= element_text(angle = 0))
)
Desired output:
I saw this SO Post where they are using reorder but I couldn't find any reorder in plotnine to work.
Plotnine does have reorder. It is an internal function available when creating and aesthetic mapping, just like factor.
In your example you could use it like this:
ggplot(data=test_df, mapping=aes(x='reorder(catg, val)', y='val'))
I was just looking at https://en.wikipedia.org/wiki/Chi-squared_test and wanted to recreate the example "Example chi-squared test for categorical data".
I feel that the approach I've taken might have room for improvement, so was wondering how that might be done.
Here's the code:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
observed_workers = pd.read_csv(io.StringIO(csv), index_col=0)
col_sums = dt.apply(sum)
row_sums = dt.apply(sum, axis=1)
l = list(x[1] * (x[0] / col_sums.sum()) for x in itertools.product(row_sums, col_sums))
expected_workers = pd.DataFrame(
np.array(l).reshape((3, 4)),
columns=observed_workers.columns,
index=observed_workers.index,
)
chi_squared_stat = (
((observed_workers - expected_workers) ** 2).div(expected_workers).sum().sum()
)
This returns the correct value, but is probably ignorant of a nicer approach using some particular numpy / pandas methods.
With numpy/scipy:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
import io
from numpy import genfromtxt, outer
from scipy.stats.contingency import margins
observed = genfromtxt(io.StringIO(csv), delimiter=',', skip_header=True, usecols=range(1, 5))
row_sums, col_sums = margins(observed)
expected = outer(row_sums, col_sums) / observed.sum()
chi_squared_stat = ((observed - expected)**2 / expected).sum()
print(chi_squared_stat)
With pandas:
import io
import pandas as pd
csv = """\
work_group,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
df = pd.read_csv(io.StringIO(csv))
df_melt = df.melt(id_vars ='work_group', var_name='group', value_name='observed')
df_melt['col_sum'] = df_melt.groupby('group')['observed'].transform(np.sum)
df_melt['row_sum'] = df_melt.groupby('work_group')['observed'].transform(np.sum)
total = df_melt['observed'].sum()
df_melt['expected'] = df_melt.apply(lambda row: row['col_sum']*row['row_sum']/total, axis=1)
chi_squared_stat = df_melt.apply(lambda row: ((row['observed'] - row['expected'])**2) / row['expected'], axis=1).sum()
print(chi_squared_stat)
I am trying to run a fama-macbeth regression in a python. As afirst step I am running the time series for every asset in my portfolio but I am unable to run it because I am getting an error:
'ValueError: Must pass DataFrame with boolean values only'
I am relatively new to python and have heavily relied on this forum to help me out. I hope it you can help me with this issue.
Please let me know how I can resolve this. I will be very grateful to you!
I assume this line is producing the error. Cause when I run the function without the for loop, it works perfectly.
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
The dimension of my matrix is 108x35, 30 stocks and 5 factors over 108 points. Hence I want to run a regression for every stock against the 4 factors and store the result of the coeffs in a dataframe. Sample dataframe:
Date BAS GY AI FP SGL GY LNA GY AKZA NA Market Factor
1/29/2010 -5.28% -7.55% -1.23% -5.82% -7.09% -5.82%
2/26/2010 0.04% 13.04% -1.84% 4.06% -14.62% -14.62%
3/31/2010 10.75% 1.32% 7.33% 6.61% 12.21% 12.21%
The following is the entire code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_set = pd.read_excel(r'C:\XXX\Research Project\Data\Regression.xlsx', sheet_name = 'Fama Macbeth')
data_set.set_index(data_set['Date'], inplace=True)
data_set.drop('Date', axis=1, inplace=True)
X = data_set.iloc[:,30:]
y = data_set.iloc[:,:30]
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_results = pd.concat([df_results, df_temp], axis = 0)
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
cols = len(y.columns)
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
ValueError: Must pass DataFrame with boolean values only
I'm running a loop that appends values to an empty dataframe out side of the loop. However, when this is done, the datframe remains empty. I'm not sure what's going on. The goal is to find the power value that results in the lowest sum of squared residuals.
Example code below:
import tweedie
power_list = np.arange(1.3, 2, .01)
mean = 353.77
std = 17298.24
size = 860310
x = tweedie.tweedie(mu = mean, p = 1.5, phi = 50).rvs(len(x))
variance = 299228898.89
sum_ssr_df = pd.DataFrame(columns = ['power', 'dispersion', 'ssr'])
for i in power_list:
power = i
phi = variance/(mean**power)
tvs = tweedie.tweedie(mu = mean, p = power, phi = phi).rvs(len(x))
sort_tvs = np.sort(tvs)
df = pd.DataFrame([x, sort_tvs]).transpose()
df.columns = ['actual', 'random']
df['residual'] = df['actual'] - df['random']
ssr = df['residual']**2
sum_ssr = np.sum(ssr)
df_i = pd.DataFrame([i, phi, sum_ssr])
df_i = df_i.transpose()
df_i.columns = ['power', 'dispersion', 'ssr']
sum_ssr_df.append(df_i)
sum_ssr_df[sum_ssr_df['ssr'] == sum_ssr_df['ssr'].min()]
What exactly am I doing incorrectly?
This code isn't as efficient as is could be as noted by ALollz. When you append, it basically creates a new dataframe in memory (I'm oversimplifying here).
The error in your code is:
sum_ssr_df.append(df_i)
should be:
sum_ssr_df = sum_ssr_df.append(df_i)