Why is `summary_col` ignoring the `info_dict` parameter? - python

I need to run some linear regressions and output Latex code with statsmodels in Python. I am using the summary_col function to achieve that. However, there is either a bug or a misunderstanding from my side. Please see the following code:
import numpy as np
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
np.random.seed(123)
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e
model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()
Now, to have a table with the two models side by side:
out_table = summary_col(
[model1, model2],
stars=True,
float_format='%.2f',
info_dict={
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
}
)
Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict argument. The result I get however is the following:
==============================
y I y II
------------------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
R-squared 1.00 1.00
R-squared Adj. 1.00 1.00
N 100 100
R2 1.00 1.00
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:
==============================
y I y II
------------------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html
Any ideas on how to displayu only the information asked by the info_dict argument?

Let's have a look at the source code at
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py
We can see that the function summary_col takes info_dict as an argument and uses it in the following way
if info_dict:
cols = [_col_info(x, info_dict.get(x.model.__class__.__name__, info_dict))
for x in results]
In this case, it means that then is called _col_info(model1, info_dict) and _col_info(model2, info_dict) in order to generate your N and R2 rows. The absence of mypy and comments makes these functions quite obscure actually.
Later on, the cols list will be added to the variable summ that will be part of a Summary object.
smry = Summary()
smry._merge_latex = True
smry.add_df(summ, header=True, align='l')
However, cols is actually a redefinition, it was defined before as
cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]
and that constituted the first part of summ.
The issue is that _col_params will add R-squared and R-squared Adj. whether you like it or not, here is the source code
rsquared = getattr(result, 'rsquared', np.nan)
rsquared_adj = getattr(result, 'rsquared_adj', np.nan)
r2 = pd.Series({('R-squared', ""): rsquared,
('R-squared Adj.', ""): rsquared_adj})
if r2.notnull().any():
r2 = r2.apply(lambda x: float_format % x)
res = pd.concat([res, r2], axis=0)
res = pd.DataFrame(res)
res.columns = [str(result.model.endog_names)]
return res
So what I would suggest is to manually modify the tables attribute of your output
rm_extra_rows = lambda t : t.iloc[list(range(6)) + [8,9],:]
out_table.tables = [rm_extra_rows(el) for el in out_table.tables]
After that I get
In [53]: out_table
Out[53]:
<class 'statsmodels.iolib.summary2.Summary'>
"""
=====================
y I y II
---------------------
const 0.81** 0.63
(0.34) (0.67)
x1 0.22 0.35
(0.16) (0.31)
x2 9.99*** 9.98***
(0.02) (0.03)
N 100 100
R2 1.00 1.00
=====================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
"""
which should be what you wanted to get.

Related

Quantile residual Q-Q plot in python

I know how I can get Normal Q-Q plots in Python but how can I get quantile residual Q-Q plots?
I tried to do the three steps written here (Chapter 20.2.6.1):
First I tried to adapt this solution for use with smf.glm (I need to use smf because I have a huge dataframe with hundreds of variables I need to pass):
import numpy as np
from scipy import stats
import statsmodels.formula.api as smf
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
df_test = pd.DataFrame({'y': y, 'x': x})
loglike_method = 'nb2' # or use 'nb2'
#res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
res = smf.glm(formula="y ~ x", data=df_test, family=sm.families.NegativeBinomial()).fit(start_params=[0.1, 0.1])
print(dist0.mean())
print(res.params)
mu = res.predict() # use this for mean if not constant
mu = mu.mean()
#mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[0]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print('data generating parameters'.format(n, p))
print('estimated params'.format(size, prob))
#estimated distribution
dist_est = stats.nbinom(size, prob)
But the estimated parameters are totally off.
Next step would be to call stats.nbinom.cdf with those parameters to simulate values ...
Is this the right way?
And how can I get the correct values for size and prob from my fitted model?

Sweeping a parameter for an ODE function python

This isn't an ODE question, per see. It's more of a referencing issue, I think (but I might be wrong). I've copied the code below. I can't seem to sweep parameters for an ODE function that I would like to run. Any advice/insight on how to fix this problem would be greatly appreciated! Thank you very much in advance!
CONTEXTUAL CODE:
import numpy as np
from scipy.integrate import odeint # odeint allows to run ODEs
import matplotlib.pyplot as plt
def model(x, t):
# Definitions. The 'x' input is of the form [B0, G0, S0, F0]. So they need to be allocated to the appropriate variable here.
B = x[0]
G = x[1]
S = x[2]
F = x[3]
# ODEs.
dBdt = l - d*G*(B/(B+K0)) - aB*B
dGdt = pG1*(B**n/((B**n)+(K1**n))) + pG2*(S**n/((S**n)+(K2**n))) - aG*G
dSdt = pS*((F**n)/((F**n)+(K4**n))) - aS*S
dFdt = pF*((K3**n)/((K3**n)+(B**n))) - aF*F
return [dBdt, dGdt, dSdt, dFdt]
# Parameters for 'model'
# This is a list of the parameters that are used in the 'model' function above.
pG1 = 0.25
pG2 = 0.25
pF = 0.25
pS = 0.25
aB = 0.25
aG = 0.25
aF = 0.25
aS = 0.25
K0 = 0.4
K1 = 0.5
K2 = 0.3
K3 = 0.45
K4 = 0.35
n = 3
l = 0.25
d = 1.5
n1 = 3
n2 = 3
n3 = 3
n4 = 3
# Initial conditions for the ODE
model_x0 = [1,0,0,0] # this will be entered as an input to the 'model' function
#Defining the timeline of the model
model_t = np.linspace (0, 50, 200)
def sweep(param, p_low, p_high, values):
B = np.array([])
parameter_values = np.linspace(p_low, p_high, values)
for parameter_value in parameter_values:
param = parameter_value # **THIS IS THE KEY SECTION, I THINK. 'param' isn't referencing the variable that is being given in the argument of the call**
model_result = odeint(model, model_x0, model_t)
temp= np.array(model_result[:,0])
B = np.append(B, temp, axis=0)
return tuple(B)
When I test the sweep with two values for 'pG1' (they should give different outputs):
test = sweep(pG1, 0, 0.8, 2)
test1 = test[:200]
test2 = test[200:]
test1==test2
This outputs True. And it shouldn't.
Out[10]: True
Based on my understanding you basically want to sweep the variable in this scenario pG1 through your ode. THe primary mistake is that the ODE is not accepting the values. odeint allows for odeint(model,model_init,t, args=(a,b,c)) according to the docs. Seeing as you are initializing the parameters globally, it doesnt actually work since the variable isn't changed in the initial ode. I am not an expert at it but I got a working version with some changes to your code. Pretty sure there is a more elegant way of doing this which I hope someone can contribute.
import numpy as np
from scipy.integrate import odeint # odeint allows to run ODEs
import matplotlib.pyplot as plt
def model(x, t,param_value): #we modify the ode model slightly to allow the passing of parameters.
pG1 = param_value['pG1'] #since its a dict we can access the parameter like this. I used a dict because if you want to sweep for all the other parameters its much easier like this I think.
# Definitions. The 'x' input is of the form [B0, G0, S0, F0]. So they need to be allocated to the appropriate variable here.
B = x[0]
G = x[1]
S = x[2]
F = x[3]
# ODEs.
dBdt = l - d*G*(B/(B+K0)) - aB*B
dGdt = pG1*(B**n/((B**n)+(K1**n))) + pG2*(S**n/((S**n)+(K2**n))) - aG*G
dSdt = pS*((F**n)/((F**n)+(K4**n))) - aS*S
dFdt = pF*((K3**n)/((K3**n)+(B**n))) - aF*F
#print(pG1) # debugged here to see if the value actually changed
return [dBdt, dGdt, dSdt, dFdt]
# Parameters for 'model'
# This is a list of the parameters that are used in the 'model' function above.
pG1 = 0.25 #Defining the parameters like this defines it once. It does not change in your original code. model takes the values here but there is no call to change it.
pG2 = 0.25
pF = 0.25
pS = 0.25
aB = 0.25
aG = 0.25
aF = 0.25
aS = 0.25
K0 = 0.4
K1 = 0.5
K2 = 0.3
K3 = 0.45
K4 = 0.35
n = 3
l = 0.25
d = 1.5
n1 = 3
n2 = 3
n3 = 3
n4 = 3
param = {'pG1':pG1,'pG2':pG2,'pF':pF,'pS':pS,'aB':aB,'aG':aG,'aF':aF,'aS':aS,'K0':K0,'K1':K1,'K2':K2,'K3':K3,'K4':K4,'n':n,'l':l,'d':d,'n1':n1,'n2':n2,'n3':n3,'n4':n4} # Here we put all your parameters defined into a nice dict.
# Initial conditions for the ODE
model_x0 = [1,0,0,0] # this will be entered as an input to the 'model' function
#Defining the timeline of the model
model_t = np.linspace (0, 50, 200)
def sweep(p_name, p_low, p_high, values, param): #note i changed the input name. So we pass the variable name we want to sweep here.
B = np.array([])
parameter_values = np.linspace(p_low, p_high, values)
for parameter_value in parameter_values:
param[p_name] = parameter_value # Here we use the linspace values to 'sweep' the parameter value that we are manipulating
model_result = odeint(model, model_x0, model_t,args = (param,)) #here we pass the 'new' parameters into the actual ode.
temp= np.array(model_result[:,0])
B = np.append(B, temp, axis=0)
return tuple(B)
test = sweep('pG1', 0, 0.8, 2,param) #so we pass the 'default' parameters
test1 = test[:200]
test2 = test[200:]
print(test1==test2)
>>False
This is definitely a hacky way to do it but it works. I am sure someone with more experience using odeint and the numpy/scipy package can give you an easier/cleaner way to do this. You can extend this for all the parameters if you so wish. I would not recommend globals however.

Maximize objective using scipy (by kelly criterium)

I have the following two pandas dataframes: new & outcome
new = pd.DataFrame([[5,5,1.6],[0.22,0.22,0.56]]).T
new.index = ['Visitor','Draw','Home']
new.columns = ['Decimal odds', 'Win prob']
new['Bet amount'] = np.zeros((len(new),1))
With output:
Decimal odds Win prob Bet amount
Visitor 5.0 0.22 0.0
Draw 5.0 0.22 0.0
Home 1.6 0.56 0.0
And dataframe 'outcome'
outcome = pd.DataFrame([[0.22,0.22,0.56],[100,100,100]]).T
outcome.index = ['Visitor win','Draw','Home win']
outcome.columns = ['Prob.','Starting bankroll']
outcome['Wins'] = ((new['Decimal odds'] - 1) * new['Bet amount']).values
outcome['Losses'] = [sum(new['Bet amount'][[1,2]]) , sum(new['Bet amount'][[0,2]]), sum(new['Bet amount'][[0,1]])]
outcome['Ending bankroll'] = outcome['Starting bankroll'] + outcome['Wins'] - outcome['Losses']
outcome['Logarithm'] = np.log(outcome['Ending bankroll'])
With output:
Prob. Starting bankroll Wins Losses Ending bankroll Logarithm
Visitor win 0.22 100.0 0.0 0.0 100.0 4.60517
Draw 0.22 100.0 0.0 0.0 100.0 4.60517
Home win 0.56 100.0 0.0 0.0 100.0 4.60517
Hereby the objective is calculated by the formula below:
objective = sum(outcome['Prob.'] * outcome['Logarithm'])
Now I want to maximize the objective by the values contained in column `new['Bet amount']. The constraints are that a, b, and c are bounded between 0 and 100. Also the summation of a, b and c must be below 100. Reason is that a,b,c resemble the ratio of your bankroll that is used to place a sports bet.
Want to achieve this using the scipy library. My code so far looks like:
from scipy.optimize import minimize
prob = new['Win prob']
decimal = new['Decimal odds']
bank = outcome['Starting bankroll'][0]
def constraint1(bet):
a,b,c = bet
return 100 - a + b + c
con1 = {'type': 'ineq', 'fun': constraint1}
cons = [con1]
b0, b1, b2 = (0,100), (0,100), (0,100)
bnds = (b0, b1, b2)
def f(bet, sign = -1):
global prob, decimal, bank
p0,p1,p2 = prob
d0,d1,d2 = decimal
a,b,c = bet
wins0 = a * (d0-1)
wins1 = b * (d1-1)
wins2 = c * (d2-1)
loss0 = b + c
loss1 = a + c
loss2 = a + b
log0 = np.log(bank + wins0 - loss0)
log1 = np.log(bank + wins1 - loss1)
log2 = np.log(bank + wins2 - loss2)
objective = (log0 * p0 + log1 * p1 + log2 * p2)
return sign * objective
bet = [5,8,7]
result = minimize(f, bet, method = 'SLSQP', bounds = bnds, constraints = cons)
This however, does not result in the desired result. Desired result would be:
a = 3.33
b = 3.33
c = 0
My question is also how to set the method and initial values? Results seem to differ a lot by assigning different method's and initial values for the bets.
Any help would be greatly appreciated!
(This is an example posted on the pinnacle website: https://www.pinnacle.com/en/betting-articles/Betting-Strategy/the-real-kelly-criterion/HZKJTFCB3KNYN9CJ)
If you print out the "bet" values inside your function, you can see where it's going wrong.
[5. 8. 7.]
[5.00000001 8. 7. ]
[5. 8.00000001 7. ]
[5. 8. 7.00000001]
[5.00040728 7.9990977 6.99975556]
[5.00040729 7.9990977 6.99975556]
[5.00040728 7.99909772 6.99975556]
[5.00040728 7.9990977 6.99975558]
[5.00244218 7.99458802 6.99853367]
[5.0024422 7.99458802 6.99853367]
The algorithm is trying to optimize the formula with very small adjustments relative to your initial values, and it never adjusts enough to get to the values you're looking for.
If you check scipy webpage, you find https://docs.scipy.org/doc/scipy/reference/optimize.minimize-slsqp.html#optimize-minimize-slsqp
eps float
Step size used for numerical approximation of the Jacobian.
result = minimize(f, bet, method='SLSQP', bounds=bnds, constraints=cons,
options={'maxiter': 100, 'ftol': 1e-06, 'iprint': 1, 'disp': True,
'eps': 1.4901161193847656e-08, 'finite_diff_rel_step': None})
So you're starting off with a step size of 1.0e-08, so your initial estimates are off by many orders of magnitude outside the range where the algorithm is going to be looking.
I'd recommend normalizing your bets to values between zero and 1. So instead of saying I'm placing a bet between 0 and 100, just say you're wagering a fraction of your net wealth between 0 and 1. A lot of algorithms are designed to work with standardized inputs (between 0 and 1) or normalized inputs (standard deviations from the mean).
Also, it looks like :
def constraint1(bet):
a,b,c = bet
return 100 - a + b + c
should be:
def constraint1(bet):
a,b,c = bet
return 100 - (a + b + c)
but I don't think that impacts your results

Knowledge transfer in regularised linear regression

By default all regularised linear regression techniques of scikit-learn pull the model coefficients w towards 0 with increased alpha. Is it possible to instead pull the coefficients towards some predefined values? In my application I do have such values that have been obtained from a previous analysis of a similar but much larger dataset. In other words, can I transfer the knowledge from one model to another?
The documentation of LassoCV says:
The optimization objective for Lasso is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
In theory it's easy to incorporate previously obtained coefficients w0 by changing the above to
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w - w0||_1
The problem is that the actual optimisation is carried out by the Cython function enet_coordinate_descent (called via lasso_path and enet_path). If I want to change it, do I need to fork, modify, and recompile the whole sklearn.linear_model package or reimplement the whole optimisation routine?
Toy example
The following code defines a dataset X with 4 features and a matching response vector y.
import numpy as np
from sklearn.linear_model import LassoCV
n = 50
x1 = np.random.normal(10, 8, n)
x2 = np.random.normal(8, 6, n)
X = np.column_stack([x1, x1 ** 2, x2, x2 ** 2])
y = .8 * x1 + .2 * x2 + .7 * x2**2 + np.random.normal(0, 3, n)
cv = LassoCV(cv=10).fit(X, y)
The resulting coefficients and alpha are
>>> print(cv.coef_)
[ 0.46262115 0.01245427 0. 0.70642803]
>>> print(cv.alpha_)
7.63613474003
If we had prior knowledge regarding two of the coefficients w0 = np.array([.8, 0, .2, 0]), how could that be incorporated?
My final solution, based on #lejlot's answer
Rather than using vanilla GD I eventually arrived at using Adam.
This solution just fits a lasso for a given value of alpha, it does not find the value alpha by itself like LassoCV does (but it's easy to add a layer of CV on top of it).
from autograd import numpy as np
from autograd import grad
from autograd.optimizers import adam
def fit_lasso(X, y, alpha=0, W0=None):
if W0 is None:
W0 = np.zeros(X.shape[1])
def l1_loss(W, i):
# i is only used for compatibility with adam
return np.mean((np.dot(X, W) - y) ** 2) + alpha * np.sum(np.abs(W - W0))
gradient = grad(l1_loss)
def print_w(w, i, g):
if (i + 1) % 250 is 0:
print("After %i step: w = %s" % (i + 1, np.array2string(w.T)))
W_init = np.random.normal(size=(X.shape[1], 1))
W = adam(gradient, W_init, step_size=.1, num_iters=1000, callback=print_w)
return W
n = 50
x1 = np.random.normal(10, 8, n)
x2 = np.random.normal(8, 6, n)
X = np.column_stack([x1, x1 ** 2, x2, x2 ** 2])
y = .8 * x1 + .2 * x2 + .7 * x2 ** 2 + np.random.normal(0, 3, n)
fit_lasso(X, y, alpha=30)
fit_lasso(X, y, alpha=30, W0=np.array([.8, 0, .2, 0]))
After 250 step: w = [[ 0.886 0.131 0.005 0.291]]
After 500 step: w = [[ 0.886 0.131 0.003 0.291]]
After 750 step: w = [[ 0.886 0.131 0.013 0.291]]
After 1000 step: w = [[ 0.887 0.131 0.013 0.292]]
After 250 step: w = [[ 0.868 0.129 0.728 0.247]]
After 500 step: w = [[ 0.803 0.132 0.717 0.249]]
After 750 step: w = [[ 0.801 0.132 0.714 0.249]]
After 1000 step: w = [[ 0.801 0.132 0.714 0.249]]
The results are quite similar on this example, but you can at least tell that specifying a W0 prevented the model from killing the third coefficient.
The effect is only apparent if you use an alpha > 20 or thereabouts.
In short - yes, you need to do it by hand by recompiling everything. Scikit-learn is not a library for customizable ML models. It is about providing simple, typical models with easy to use interface. If you want customization you should look for things like tensorflow, keras etc. or at least - autograd. In fact with autograd this is extremely simple, since you can write your code with numpy and use autograd to compute gradients.
X = ... your data
y = ... your targets
W0 = ... target weights
alpha = ... pulling strength
lr = ... learning rate (step size of gradient descent)
from autograd import numpy as np
from autograd import grad
def your_loss(W):
return np.mean((np.dot(X, W) - y)**2) + alpha * np.sum(np.abs(W - W0))
g = grad(your_loss)
W = np.random.normal(size=(X.shape[1], 1))
for i in range(100):
W = W - lr * g(W)
print(W)

How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

Let's say that we have two samples data1 and data2 with their respective weights weight1 and weight2 and that we want to calculate the Kolmogorov-Smirnov statistic between the two weighted samples.
The way we do that in python follows:
import numpy as np
def ks_w(data1,data2,wei1,wei2):
ix1=np.argsort(data1)
ix2=np.argsort(data2)
wei1=wei1[ix1]
wei2=wei2[ix2]
data1=data1[ix1]
data2=data2[ix2]
d=0.
fn1=0.
fn2=0.
j1=0
j2=0
j1w=0.
j2w=0.
while(j1<len(data1))&(j2<len(data2)):
d1=data1[j1]
d2=data2[j2]
w1=wei1[j1]
w2=wei2[j2]
if d1<=d2:
j1+=1
j1w+=w1
fn1=(j1w)/sum(wei1)
if d2<=d1:
j2+=1
j2w+=w2
fn2=(j2w)/sum(wei2)
if abs(fn2-fn1)>d:
d=abs(fn2-fn1)
return d
where we just modify to our purpose the classical two-sample KS statistic as implemented in Press, Flannery, Teukolsky, Vetterling - Numerical Recipes in C - Cambridge University Press - 1992 - pag.626.
Our questions are:
is anybody aware of any other way to do it?
is there any library in python/R/* that performs it?
what about the test? Does it exist or should we use a reshuffling procedure in order to evaluate the statistic?
This solution is based on the code for scipy.stats.ks_2samp and runs in about 1/10000 the time (notebook):
import numpy as np
def ks_w2(data1, data2, wei1, wei2):
ix1 = np.argsort(data1)
ix2 = np.argsort(data2)
data1 = data1[ix1]
data2 = data2[ix2]
wei1 = wei1[ix1]
wei2 = wei2[ix2]
data = np.concatenate([data1, data2])
cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
cdf1we = cwei1[[np.searchsorted(data1, data, side='right')]]
cdf2we = cwei2[[np.searchsorted(data2, data, side='right')]]
return np.max(np.abs(cdf1we - cdf2we))
Here's a test of its accuracy and performance:
ds1 = np.random.rand(10000)
ds2 = np.random.randn(40000) + .2
we1 = np.random.rand(10000) + 1.
we2 = np.random.rand(40000) + 1.
ks_w2(ds1, ds2, we1, we2)
# 0.4210415232236593
ks_w(ds1, ds2, we1, we2)
# 0.4210415232236593
%timeit ks_w2(ds1, ds2, we1, we2)
# 100 loops, best of 3: 17.1 ms per loop
%timeit ks_w(ds1, ds2, we1, we2)
# 1 loop, best of 3: 3min 44s per loop
To add to Luca Jokull's answer, if you want to also return a p-value (similar to the unweighted scipy.stats.ks_2samp function), the suggested ks_w2() function can be modified as follows:
from scipy.stats import distributions
def ks_weighted(data1, data2, wei1, wei2, alternative='two-sided'):
ix1 = np.argsort(data1)
ix2 = np.argsort(data2)
data1 = data1[ix1]
data2 = data2[ix2]
wei1 = wei1[ix1]
wei2 = wei2[ix2]
data = np.concatenate([data1, data2])
cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
cdf1we = cwei1[np.searchsorted(data1, data, side='right')]
cdf2we = cwei2[np.searchsorted(data2, data, side='right')]
d = np.max(np.abs(cdf1we - cdf2we))
# calculate p-value
n1 = data1.shape[0]
n2 = data2.shape[0]
m, n = sorted([float(n1), float(n2)], reverse=True)
en = m * n / (m + n)
if alternative == 'two-sided':
prob = distributions.kstwo.sf(d, np.round(en))
else:
z = np.sqrt(en) * d
# Use Hodges' suggested approximation Eqn 5.3
# Requires m to be the larger of (n1, n2)
expt = -2 * z**2 - 2 * z * (m + 2*n)/np.sqrt(m*n*(m+n))/3.0
prob = np.exp(expt)
return d, prob
This is the asymptotic method that scipy's original unweighted function uses.
This is a R version of a two-tails weighted KS statistic following the suggestion of Numerical Methods of Statistics by Monohan, pg. 334 in 1E and pg. 358 in 2E.
ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
xw <- c(vector_1,vector_2)
d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))
## P-VALUE with NORMAL SAMPLE
# n_vector_1 <- length(vector_1)
# n_vector_2<- length(vector_2)
# n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)
# P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)
pkstwo <- function(x, tol = 1e-06) {
if (is.numeric(x))
x <- as.double(x)
else stop("argument 'x' must be numeric")
p <- rep(0, length(x))
p[is.na(x)] <- NA
IND <- which(!is.na(x) & (x > 0))
if (length(IND))
p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
p
}
pval <- 1 - pkstwo(sqrt(n) * d)
out <- c(KS_Stat=d, P_value=pval)
return(out)
}

Categories