Trying to use statsmodels.glm with constraints and unable to get (what I think should be) a simple requirement to work...
import numpy as np
import pandas as pd
from statsmodels.tools import add_constant
from statsmodels.formula.api import glm
test = np.array([[10,29,30],[32,26,23],[34,39,46]])
exog = add_constant(test)
y = np.array([11,23,27])
formula = 'y ~ x + q'
formula_df = formula_df = pd.DataFrame({'y' : y,'x':test[:,0],'q':test[:,1],'r':test[:,2],'int':exog[:,3]})
glmm = glm(formula=formula,data=formula_df)
glmm_results = glmm.fit_constrained(constraints='q = 0')
print(glmm_results.summary())
from the docs for fit constraints:
The constraints are of the form R params = q where R is the constraint_matrix and q is the vector of constraint_values.
The estimation creates a new model with transformed design matrix, exog, and converts the results back to the original parameterization.
Parameters
constraintsformula expression or tuple
If it is a tuple, then the constraint needs to be given by two arrays (constraint_matrix, constraint_value), i.e. (R, q). Otherwise, the constraints can be given as strings or list of strings. see t_test for details
however, if I try to simply ensure that q is > 0 by changing the constraint to:
glm_results = glmm.fit_constrained(constraints='q > 0')
then i get an unrecognized token error
PatsyError: unrecognized token in constraint
q > 0
^
I've tried '>>' as well. There isn't much documentation I can find for statsmodels and writing constraints beyond what I copy/pasted above. How do I get this to work?
An alternate question would be how do I write the 4 simplest types of constraints on coefficients in the q,R format (x = 0, x<0, x>0, a<x<b)?
Related
I am using GEKKO for solving an MINLP problem in Python that involves a co-simulation with PowerFactory, a power system simulation software. The MINLP problem objective function needs to call PowerFactory from within and execute particular tasks before returning the values to the MILP problem definition. In the equality constraint definition, I need to also use two Pandas data frames (tables with 10000x2 cells) to get particular values corresponding to the values of the decision variables in the constraint expression. But, I am getting errors while writing any code involving loops, or Pandas 'loc' or 'iloc' functions, within the objective/constraint functions due to some issues with the nature of GEKKO variables in these function calls. Any help in this regard would be much appreciated. The structure of the code is given below:
import powerfactory as pf
from gekko import GEKKO
import pandas as pd
# Execute the PF setup commands for Python
# Pandas dataframe in question
df1 = pd.read_csv("table1.csv")
df2 = pd.read_csv("table2.csv")
def constraint1(X):
P = 0
for i in range(length(X)):
V = X[i] * 1000
D = round(V, 1)
P += df1.loc[D, "ColumnName"]
return P
def objective(X):
for i in range(length(X)):
V = X[i] * 1000
D = round(V, 1)
I = df2.loc[D, "ColumnName"]
# A list with the different values of 'I' is generated for passing to PF
# Some PowerFactory based commands below, involving inputs, and extracting results from PF, using a list generated from the values of 'I'. PF returns some result values to the code
return results
# MINLP problem definition:
m = GEKKO(remote=False)
x = m.Array(m.Var, nInv, value=1.0, lb=0.533, ub=1.0)
m.Equation(constraint1(x) == 30)
m.Minimize(objective(x))
m.options.SOLVER = 3
m.solve(disp=False)
# Command for exporting the results to a .txt file
In another problem formulation, I am trying to run MINLP optimization problems within the objective and constraint function as a nested optimization problem. However, I am running into errors in that as well. The structure of the code is as follows:
import powerfactory as pf
from gekko import GEKKO
# Execute the PF setup commands for Python
def constraint1(X):
P = 0
for i in range(length(X)):
V = X[i] * 1000
# 2nd MINLP problem: Finds 'I' from value of 'V', a single element
# Calculate 'Pcal' from 'I'
P += Pcal
return P
def objective(X):
Iset = []
for i in range(length(X)):
V = X[i] * 1000
# 3rd MINLP problem: Finds 'I' from value of 'V'
# Appends 'I' to a 'Iset'
# 'Iset' list passed on to PF
# Some PowerFactory based commands below, involving inputs, and extracting results from PF, using the 'Iset' list. PF returns some result values to the code
return results
# Main MINLP problem definition:
m = GEKKO(remote=False)
x = m.Array(m.Var, nInv, value=1.0, lb=0.533, ub=1.0)
m.Equation(constraint1(x) == 30)
m.Minimize(objective(x))
m.options.SOLVER = 3
m.solve(disp=False)
# Command for exporting the results to a .txt file
Gekko does not have call-backs to external functions. This is because the solvers are gradient-based and a precursor steps is automatic differentiation to obtain exact 1st and 2nd derivatives in sparse form. Similar to Gekko and CoolProp, one option is to build an approximation to your external (in this case PowerFactory) model that the optimizer can use. Two options are the cubic spline (cspline) or the basis spline (bspline).
If you can't use these approximations then I recommend switching to a different platform for solving the optimization problem. Perhaps you could try scipy.optimize.minimize that can obtain gradients by finite difference and add branch and bound to solve the mixed integer portion.
I am trying to cluster 2 distributions with K-Means and using cosine similarity as a metric to define similarity. I wrote the following code. But it gives an error saying:
Error: no centroid defined for empty cluster.
Try setting argument 'avoid_empty_clusters' to True
I could not understand the reason for this. I need to create 2 clusters.
import numpy as np
from nltk.cluster.kmeans import KMeansClusterer
import nltk as nltk
np.random.seed(1)
distr_1 = np.sin(2 * np.random.randn(100) + np.random.randn())
distr_2 = (3 * np.random.randn(100)) + np.random.randn()
x = list(range(0,100))
X_train = np.concatenate((distr_1, distr_2))
X_train = X_train.reshape(200,1)
kclusterer = KMeansClusterer(2, distance=nltk.cluster.util.cosine_distance, repeats=100)
assigned_clusters = kclusterer.cluster(X_train.flatten(), assign_clusters=True)
print(assigned_clusters)
Because of bad starting conditions it can happen that a cluster becomes empty.
Your code needs to handle this exceptional case.
I am trying to solve an equation in Python. Basically what I want to do is to solve the equation:
(1/x^2)*d(Gam*dL/dx)/dx)+(a^2*x^2/Gam-(m^2))*L=0
This is the Klein-Gordon equation for a massive scalar field in a Schwarzschild spacetime. It suppose that we know m and Gam=x^2-2*x. The initial/boundary condition that I know are L(2+epsilon)=1 and L(infty)=0. Notice that the asymptotic behavior of the equation is
L(x-->infty)-->Exp[(m^2-a^2)*x]/x and Exp[-(m^2-a^2)*x]/x
Then, if a^2>m^2 we will have oscillatory solutions, while if a^2 < m^2 we will have a divergent and a decay solution.
What I am interested is in the decay solution, however when I am trying to solve the above equation transforming it as a system of first order differential equations and using the shooting method in order to find the "a" that can give me the behavior that I am interested about, I am always having a divergent solution. I suppose that it is happening because odeint is always finding the divergent asymptotic solution. Is there a way to avoid or tell to odeint that I am interested in the decay solution? If not, do you know a way that I could solve this problem? Maybe using another method for solving my system of differential equations? If yes, which method?
Basically what I am doing is to add a new system of equation for "a"
(d^2a/dx^2=0, da/dx(2+epsilon)=0,a(2+epsilon)=a_0)
in order to have "a" as a constant. Then I am considering different values for "a_0" and asking if my boundary conditions are fulfilled.
Thanks for your time. Regards,
Luis P.
I am incorporating the value at infinity considering the assimptotic behavior, it means that I will have a relation between the field and its derivative. I will post the code for you if it is helpful:
from IPython import get_ipython
get_ipython().magic('reset -sf')
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from math import *
from scipy.integrate import ode
These are initial conditions for Schwarzschild. The field is invariant under reescaling, then I can use $L(2+\epsilon)=1$
def init_sch(u_sch):
om = u_sch[0]
return np.array([1,0,om,0]) #conditions near the horizon, [L_c,dL/dx,a,da/dx]
These are our system of equations
def F_sch(IC,r,rho_c,m,lam,l,j=0,mu=0):
L = IC[0]
ph = IC[1]
om = IC[2]
b = IC[3]
Gam_sch=r**2.-2.*r
dR_dr = ph
dph_dr = (1./Gam_sch)*(2.*(1.-r)*ph+L*(l*(l+1.))-om**2.*r**4.*L/Gam_sch+(m**2.+lam*L**2.)*r**2.*L)
dom_dr = b
db_dr = 0.
return [dR_dr,dph_dr,dom_dr,db_dr]
Then I try for different values of "om" and ask if my boundary conditions are fulfilled. p_sch are the parameters of my model. In general what I want to do is a little more complicated and in general I will need more parameters that in the just massive case. Howeve I need to start with the easiest which is what I am asking here
p_sch = (1,1,0,0) #[rho_c,m,lam,l], lam and l are for a more complicated case
ep = 0.2
ep_r = 0.01
r_end = 500
n_r = 500000
n_omega = 1000
omega = np.linspace(p_sch[1]-ep,p_sch[1],n_omega)
r = np.linspace(2+ep_r,r_end,n_r)
tol = 0.01
a = 0
for j in range(len(omega)):
print('trying with $omega =$',omega[j])
omeg = [omega[j]]
ini = init_sch(omeg)
Y = odeint(F_sch,ini,r,p_sch,mxstep=50000000)
print Y[-1,0]
#Here I ask if my asymptotic behavior is fulfilled or not. This should be basically my value at infinity
if abs(Y[-1,0]*((p_sch[1]**2.-Y[-1,2]**2.)**(1/2.)+1./(r[-1]))+Y[-1,1]) < tol:
print(j,'times iterations in omega')
print("R'(inf)) = ", Y[-1,0])
print("\omega",omega[j])
omega_1 = [omega[j]]
a = 10
break
if a > 1:
break
Basically what I want to do here is to solve the system of equations giving different initial conditions and find a value for "a=" (or "om" in the code) that should be near to my boundary conditions. I need this because after this I can give such initial guest to a secant method and try to fiend a best value for "a". However, always that I am running this code I am having divergent solutions that it is, of course, a behavior that I am not interested. I am trying the same but considering the scipy.integrate.solve_vbp, but when I run the following code:
from IPython import get_ipython
get_ipython().magic('reset -sf')
import numpy as np
import matplotlib.pyplot as plt
from math import *
from scipy.integrate import solve_bvp
def bc(ya,yb,p_sch):
m = p_sch[1]
om = p_sch[4]
tol_s = p_sch[5]
r_end = p_sch[6]
return np.array([ya[0]-1,yb[0]-tol_s,ya[1],yb[1]+((m**2-yb[2]**2)**(1/2)+1/r_end)*yb[0],ya[2]-om,yb[2]-om,ya[3],yb[3]])
def fun(r,y,p_sch):
rho_c = p_sch[0]
m = p_sch[1]
lam = p_sch[2]
l = p_sch[3]
L = y[0]
ph = y[1]
om = y[2]
b = y[3]
Gam_sch=r**2.-2.*r
dR_dr = ph
dph_dr = (1./Gam_sch)*(2.*(1.-r)*ph+L*(l*(l+1.))-om**2.*r**4.*L/Gam_sch+(m**2.+lam*L**2.)*r**2.*L)
dom_dr = b
db_dr = 0.*y[3]
return np.vstack((dR_dr,dph_dr,dom_dr,db_dr))
eps_r=0.01
r_end = 500
n_r = 50000
r = np.linspace(2+eps_r,r_end,n_r)
y = np.zeros((4,r.size))
y[0]=1
tol_s = 0.0001
p_sch= (1,1,0,0,0.8,tol_s,r_end)
sol = solve_bvp(fun,bc, r, y, p_sch)
I am obtaining this error: ValueError: bc return is expected to have shape (11,), but actually has (8,).
ValueError: bc return is expected to have shape (11,), but actually has (8,).
I have fitted a logisitic regression model to some data, everything is working great. I need to calculate the wald statistic which is a function of the model result.
My problem is that I do not understand, from the documentation, what the wald test requires as input? Specifically what is the R matrix and how is it generated?
I tried simply inputting the data I used to train and test the model as the R matrix, but I do not think this is correct. The documentation suggest examining the examples however none give an example of this test. I also asked the same question on crossvalidated but got shot down.
Kind regards
http://statsmodels.sourceforge.net/0.6.0/generated/statsmodels.discrete.discrete_model.LogitResults.wald_test.html#statsmodels.discrete.discrete_model.LogitResults.wald_test
The Wald test is used to test if a predictor is significant or not, of the form:
W = (beta_hat - beta_0) / SE(beta_hat) ~ N(0,1)
So somehow you'll want to input the predictors into the test. Judging from the example of the t.test and f.test, it may be simpler to input a string or tuple to indicate what you are testing.
Here is their example using a string for the f.test:
from statsmodels.datasets import longley
from statsmodels.formula.api import ols
dta = longley.load_pandas().data
formula = 'TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR'
results = ols(formula, dta).fit()
hypotheses = '(GNPDEFL = GNP), (UNEMP = 2), (YEAR/1829 = 1)'
f_test = results.f_test(hypotheses)
print(f_test)
And here is their example using a tuple:
import numpy as np
import statsmodels.api as sm
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
results = sm.OLS(data.endog, data.exog).fit()
r = np.zeros_like(results.params)
r[5:] = [1,-1]
T_test = results.t_test(r)
If you're still struggling getting the wald test to work, include your code and I can try to help make it work.
I'm trying to generate random variables according to a certain ugly distribution, in Python. I have an explicit expression for the PMF, but it involves some products which makes it unpleasant to obtain and invert the CDF (see below code for explicit form of PMF).
In essence, I'm trying to define a random variable in Python by its PMF and then have built-in code do the hard work of sampling from the distribution. I know how to do this if the support of the RV is finite, but here the support is countably infinite.
The code I am currently trying to run as per #askewchan's advice below is:
import scipy as sp
import numpy as np
class x_gen(sp.stats.rv_discrete):
def _pmf(self,k,param):
num = np.arange(1+param, k+param, 1)
denom = np.arange(3+2*param, k+3+2*param, 1)
p = (2+param)*(np.prod(num)/np.prod(denom))
return p
pa_limit = limitrv_gen()
print pa_limit.rvs(alpha,n=1)
However, this returns the error while running:
File "limiting_sim.py", line 42, in _pmf
num = np.arange(1+param, k+param, 1)
TypeError: only length-1 arrays can be converted to Python scalars
Basically, it seems that the np.arange() list isn't working somehow inside the def _pmf() function. I'm at a loss to see why. Can anyone enlighten me here and/or point out a fix?
EDIT 1: cleared up some questions by askewchan, edits reflected above.
EDIT 2: askewchan suggested an interesting approximation using the factorial function, but I'm looking more for an exact solution such as the one that I'm trying to get work with np.arange.
You should be able to subclass rv_discrete like so:
class mydist_gen(rv_discrete):
def _pmf(self, n, param):
return yourpmf(n, param)
Then you can create a distribution instance with:
mydist = mydist_gen()
And generate samples with:
mydist.rvs(param, size=1000)
Or you can then create a frozen distribution object with:
mydistp = mydist(param)
And finally generate samples with:
mydistp.rvs(1000)
With your example, this should work, since factorial automatically broadcasts. But, it might fail for large enough alpha:
import scipy as sp
import numpy as np
from scipy.misc import factorial
class limitrv_gen(sp.stats.rv_discrete):
def _pmf(self, k, alpha):
#num = np.prod(np.arange(1+alpha, k+alpha))
num = factorial(k+alpha-1) / factorial(alpha)
#denom = np.prod(np.arange(3+2*alpha, k+3+2*alpha))
denom = factorial(k + 2 + 2*alpha) / factorial(2 + 2*alpha)
return (2+alpha) * num / denom
pa_limit = limitrv_gen()
alpha = 100
pa_limit.rvs(alpha, size=10)