Regression coefficients do not match conditional means - python

You can download the following data set from this repo.
Y
CONST
T
X1
X1T
X2
X2T
0
2.31252
1
1
0
0
1
1
1
-0.836074
1
1
1
1
1
1
2
-0.797183
1
0
0
0
1
0
I have a dependent variable (Y) and three binary columns (T, X1 and X2). From this data we can create four groups:
X1 == 0 and X2 == 0
X1 == 0 and X2 == 1
X1 == 1 and X2 == 0
X1 == 1 and X2 == 1
Within each group, I want to calculate the difference in the mean of Y between observations with T == 1 and T == 0.
I can do so with the following code:
# Libraries
import pandas as pd
# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()
# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']
# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']
> X1 X2
> 0 0 0.116175
> 1 0.168791
> 1 0 -0.027278
> 1 -0.147601
Problem
I want to recreate these results with the following regression model (m).
# Libraries
from statsmodels.api import OLS
# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()
# Estimated values
m.params[['T','X1T','X2T']]
> T 0.162198
> X1T -0.230372
> X2T -0.034303
I was expecting the coefficients:
T = 0.116175
T + X1T = 0.168791
T + X2T = -0.027278
T + X1T + X2T = -0.147601
Question
Why don't the regression coefficients match the results from the first chunk's output (t['Teq1'] - t['Teq0'])?

Thanks to #Josef for noticing that T, X1 and X2 have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).
Namely, the regression model needs to account for the interaction between X1 and X2 as well as the interaction between X1, X2 and T.
This can be done by declaring the missing interaction columns and fitting the model:
# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
X1X2T = df['X1'].multiply(df['X2T']))
# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']
# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()
Alternatively, we can use the formula interface:
# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'
# Fit model
m = OLS.from_formula(formula=f, data=df).fit()

Related

Python Z3, rule to make 2 numbers be 2 certain numbers in a 2D array

If i have 2 z3 Ints for examaple x1 and x2, and a 2d array of numbers for example:
list = [[1,2],[12,13],[45,7]]
i need to right a rule so that x1 and x2 are any of the pairs of numbers in the list for example x1 would be 1 and x2 would be 2 or x1 is 12 and x2 is 13
im guessing it would be something like:
solver = Solver()
for i in range(o,len(list)):
solver.add(And((x1==list[i][0]),(x2==list[i][1])))
but this would obviously just always be unsat, so i need to right it so that x1 and x2 can be any of the pairs in the list. It's worth noting that the number of pairs in the list could be anything not just 3 pairs.
You're on the right track. Simply iterate and form the disjunction instead. Something like:
from z3 import *
list = [[1,2],[12,13],[45,7]]
s = Solver()
x1, x2 = Ints('x1 x2')
s.add(Or([And(x1 == p[0], x2 == p[1]) for p in list]))
while s.check() == sat:
m = s.model()
print("x1 = %2d, x2 = %2d" % (m[x1].as_long(), m[x2].as_long()))
s.add(Or(x1 != m[x1], x2 != m[x2]))
When run, this prints:
x1 = 1, x2 = 2
x1 = 12, x2 = 13
x1 = 45, x2 = 7

XOR linear equation system solver in Python

I have n rows and n+1 columns matrix and need to construct such system
For example matrix is
x4 x3 x2 x1 result
1 1 0 1 0
1 0 1 0 1
0 1 0 1 1
1 0 1 1 0
Then equation will be (+ is XOR)
x4+x3+x1=0
x4+x2=1
x3+x1=1
x4+x2+x1=0
I need to return answer as list of x1,.....
How can we do it in python?
You could make use of the Python interface pyz3 of Microsoft's Z3 solver:
from z3 import *
def xor2(a, b):
return Xor(a, b)
def xor3(a, b, c):
return Xor(a, Xor(b, c))
# define Boolean variables
x1 = Bool('x1')
x2 = Bool('x2')
x3 = Bool('x3')
x4 = Bool('x4')
s = Solver()
# every equation is expressed as one constraint
s.add(Not(xor3(x4, x3, x1)))
s.add(xor2(x4, x2))
s.add(xor2(x3, x1))
s.add(Not(xor3(x4, x2, x1)))
# solve and output results
print(s.check())
print(s.model())
Result:
sat
[x3 = False, x2 = False, x1 = True, x4 = True]
Learn Gauss, can be also used for XOR. Then write a gauss python program

Can we use a pandas data frame to calculate the next value using a previous value? A good example would be the Fibonacci numbers

So I understand we can use pandas data frame to do vector operations on cells like
df = pd.Dataframe([a, b, c])
df*3
would equal something like :
0 a*3
1 b*3
2 c*3
but could we use a pandas dataframe to say calculate the Fibonacci sequence ?
I am asking this because for the Fibonacci sequence the next number depends on the previous two number ( F_n = F_(n-1) + F_(n-2) ). I am not exactly interested in the Fibonacci sequence and more interested in knowing if we can do something like:
df = pd.DataFrame([a,b,c])
df.apply( some_func )
0 x1 a
1 x2 b
2 x3 c
where x1 would be calculated from a,b,c (I know this is possible), x2 would be calculated from x1 and x3 would be calculated from x2
the Fibonacci example would just be something like :
df = pd.DataFrame()
df.apply(fib(n, df))
0 0
1 1
2 1
3 2
4 2
5 5
.
.
.
n-1 F(n-1) + F(n-2)
You need to iterate through the rows and access previous rows data by DataFrame.loc. For example, n = 6
df = pd.DataFrame()
for i in range(0, 6):
df.loc[i, 'f'] = i if i in [0, 1] else df.loc[i - 1, 'f'] + df.loc[i - 2, 'f']
df
f
0 0.0
1 1.0
2 1.0
3 2.0
4 3.0
5 5.0

Mixed integer program python

I have this optimization problem where I am trying to maximize column z based on a unique value from column X, but also within a constraint that each of the unique values picked of X added up column of Y most be less than or equal to (in this example) 23.
For example, I have this sample data:
X Y Z
1 9 25
1 7 20
1 5 5
2 9 20
2 7 10
2 5 5
3 9 10
3 7 5
3 5 5
The result should look like this:
X Y Z
1 9 25
2 9 20
3 5 5
This is replica for Set up linear programming optimization in R using LpSolve? with solution but I need the same in python.
For those who would want some help to get started with pulp in python can refer to http://ojs.pythonpapers.org/index.php/tppm/article/view/111
Github repo- https://github.com/coin-or/pulp/tree/master/doc/KPyCon2009 could be handy as well.
Below is the code in python for the dummy problem asked
import pandas as pd
import pulp
X=[1,1,1,2,2,2,3,3,3]
Y=[9,7,5,9,7,5,9,7,5]
Z=[25,20,5,20,10,5,10,5,5]
df = pd.DataFrame({'X':X,'Y':Y,'Z':Z})
allx = df['X'].unique()
possible_values = [(w,b) for w in allx for b in range(1,4)]
x = pulp.LpVariable.dicts('arr', (allx, range(1,4)),
lowBound = 0,
upBound = 1,
cat = pulp.LpInteger)
model = pulp.LpProblem("Optim", pulp.LpMaximize)
model += sum([x[w][b]*df[df['X']==w].reset_index()['Z'][b-1] for (w,b) in possible_values])
model += sum([x[w][b]*df[df['X']==w].reset_index()['Y'][b-1] for (w,b) in possible_values]) <= 23, \
"Maximum_number_of_Y"
for value in allx:
model += sum([x[w][b] for (w,b) in possible_values if w==value])>=1
for value in allx:
model += sum([x[w][b] for (w,b) in possible_values if w==value])<=1
##View definition
model
model.solve()
print("The choosen rows are out of a total of %s:"%len(possible_values))
for v in model.variables():
print v.name, "=", v.varValue
For solution in R
d=data.frame(x=c(1,1,1,2,2,2,3,3,3),y=c(9,7,5,9,7,5,9,7,5),z=c(25,20,5,20,10,5,10,5,3))
library(lpSolve)
all.x <- unique(d$x)
d[lp(direction = "max",
objective.in = d$z,
const.mat = rbind(outer(all.x, d$x, "=="), d$y),
const.dir = rep(c("==", "<="), c(length(all.x), 1)),
const.rhs = rep(c(1, 23), c(length(all.x), 1)),
all.bin = TRUE)$solution == 1,]

Regression using ols on a non varying independent variable

This is my dataset:
x y z
1 2 1
1 4 6
1 1 12
1 5 14
1 6 17
1 9 18
Now I want to do regression on this using the ols function of statsmodel library in python. For this I used:
lm = smf.ols(formula='z ~ x + I(x+y)', data=data).fit()
Now I would be getting the coefficients of x, (x+y) and intercept. Since the independent variable x is constant through out the data set, its coefficient should be 0 as the dependent variable z does not depend on x value. But my output is different from expected. My output is:
Intercept 1.293173
col1 1.293173
I(col1 + col2) 1.590361
I used the same data to find the coefficients on R using the following function:
m <- lm(z ~ x + I(x+y), data = new.data)
and for this my output is:
Coefficients:
(Intercept) x I(x + y)
2.586 NA 1.590
Why am I getting this error when I'm trying on ols model in python? How can I overcome this problem?

Categories