This is my dataset:
x y z
1 2 1
1 4 6
1 1 12
1 5 14
1 6 17
1 9 18
Now I want to do regression on this using the ols function of statsmodel library in python. For this I used:
lm = smf.ols(formula='z ~ x + I(x+y)', data=data).fit()
Now I would be getting the coefficients of x, (x+y) and intercept. Since the independent variable x is constant through out the data set, its coefficient should be 0 as the dependent variable z does not depend on x value. But my output is different from expected. My output is:
Intercept 1.293173
col1 1.293173
I(col1 + col2) 1.590361
I used the same data to find the coefficients on R using the following function:
m <- lm(z ~ x + I(x+y), data = new.data)
and for this my output is:
Coefficients:
(Intercept) x I(x + y)
2.586 NA 1.590
Why am I getting this error when I'm trying on ols model in python? How can I overcome this problem?
Related
You can download the following data set from this repo.
Y
CONST
T
X1
X1T
X2
X2T
0
2.31252
1
1
0
0
1
1
1
-0.836074
1
1
1
1
1
1
2
-0.797183
1
0
0
0
1
0
I have a dependent variable (Y) and three binary columns (T, X1 and X2). From this data we can create four groups:
X1 == 0 and X2 == 0
X1 == 0 and X2 == 1
X1 == 1 and X2 == 0
X1 == 1 and X2 == 1
Within each group, I want to calculate the difference in the mean of Y between observations with T == 1 and T == 0.
I can do so with the following code:
# Libraries
import pandas as pd
# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()
# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']
# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']
> X1 X2
> 0 0 0.116175
> 1 0.168791
> 1 0 -0.027278
> 1 -0.147601
Problem
I want to recreate these results with the following regression model (m).
# Libraries
from statsmodels.api import OLS
# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()
# Estimated values
m.params[['T','X1T','X2T']]
> T 0.162198
> X1T -0.230372
> X2T -0.034303
I was expecting the coefficients:
T = 0.116175
T + X1T = 0.168791
T + X2T = -0.027278
T + X1T + X2T = -0.147601
Question
Why don't the regression coefficients match the results from the first chunk's output (t['Teq1'] - t['Teq0'])?
Thanks to #Josef for noticing that T, X1 and X2 have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).
Namely, the regression model needs to account for the interaction between X1 and X2 as well as the interaction between X1, X2 and T.
This can be done by declaring the missing interaction columns and fitting the model:
# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
X1X2T = df['X1'].multiply(df['X2T']))
# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']
# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()
Alternatively, we can use the formula interface:
# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'
# Fit model
m = OLS.from_formula(formula=f, data=df).fit()
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
I have the following data:
X1 X2 Y
-10 4 0
-10 3 4
-10 2.5 8
-8 3 7
-8 4 8
-8 4.4 9
0 2 9
0 2.3 9.2
0 4 10
0 5 12
I need to create a simple regression model to predict Y given X1 and X2: Y = f(X1,X2).
This is my code:
poly = PolynomialFeatures(degree=2)
X1 = poly.fit_transform(df["X1"].values.reshape(-1,1))
X2 = poly.fit_transform(df["X2"].values.reshape(-1,1))
clf = linear_model.LinearRegression()
clf.fit([X1,X2], df["Y"].values.reshape(-1, 1))
print(clf.coef_)
print(clf.intercept_)
Y_test = clf.predict([X1, X2])
df_test=pd.DataFrame()
df_test["X1"] = df["X1"]
df_test["Y"] = df["Y"]
df_test["Y_PRED"] = Y_test
df_test.plot(x="X1",y=["Y","Y_PRED"], figsize=(10,5), grid=True)
plt.show()
But it fails at line clf.fit([X1,X2], df["Y"].values.reshape(-1, 1)):
ValueError: Found array with dim 3. Estimator expected <= 2
It looks like the model cannot work with 2 input parameters X1 and X2. How should I change the code to fix it?
Well, your mistake resides in the way you append your feature dataframes. You should instead concatenate them, for instance using pandas:
import pandas as pd
X12_p = pd.concat([pd.DataFrame(X1), pd.DataFrame(X2)], axis=1)
Or the same using numpy:
import numpy as np
X12_p = np.concatenate([X1, X2], axis=1)
Your final snippet should look like:
# Fit
Y = df["Y"].values.reshape(-1,1)
X12_p = pd.concat([pd.DataFrame(X1), pd.DataFrame(X2)], axis=1)
clf.fit(X12_p, Y)
# Predict
Y_test = clf.predict(X12_p)
You can as well evaluate some performance metrics such as rmse using:
from sklearn.metrics import mean_squared_error
print('rmse = {0:.5f}'.format(mean_squared_error(Y, Y_test)))
Please also note that you can exclude the bias term from polynomial features by changing the default param:
PolynomialFeatures(degree=2, include_bias=False)
Hope this helps.
I have the following data:
df = pd.DataFrame({'sound': ['A', 'B', 'B', 'A', 'B', 'A'],
'score': [10, 5, 6, 7, 11, 1]})
print(df)
sound score
0 A 10
1 B 5
2 B 6
3 A 7
4 B 11
5 A 1
If I standardize (i.e. Z score) the score variable, I get the following values. The mean of the new z column is basically 0, with SD of 1, both of which are expected for a standardized variable:
df['z'] = (df['score'] - df['score'].mean())/df['score'].std()
print(df)
print('Mean: {}'.format(df['z'].mean()))
print('SD: {}'.format(df['z'].std()))
sound score z
0 A 10 0.922139
1 B 5 -0.461069
2 B 6 -0.184428
3 A 7 0.092214
4 B 11 1.198781
5 A 1 -1.567636
Mean: -7.401486830834377e-17
SD: 1.0
However, what I'm actually interested in is calculating Z scores based on group membership (sound). For example, if a score is from sound A, then convert that value to a Z score using the mean and SD of sound A values only. Likewise, sound B Z scores will only use mean and SD from sound B. This will obviously produce different values compared to regular Z score calculation:
df['zg'] = df.groupby('sound')['score'].transform(lambda x: (x - x.mean()) / x.std())
print(df)
print('Mean: {}'.format(df['zg'].mean()))
print('SD: {}'.format(df['zg'].std()))
sound score z zg
0 A 10 0.922139 0.872872
1 B 5 -0.461069 -0.725866
2 B 6 -0.184428 -0.414781
3 A 7 0.092214 0.218218
4 B 11 1.198781 1.140647
5 A 1 -1.567636 -1.091089
Mean: 3.700743415417188e-17
SD: 0.894427190999916
My question is: why is the mean of the group-based standardized values (zg) also basically equal to 0? Is this expected behaviour or is there an error in my calculation somewhere?
The z scores make sense because standardizing within a variable essentially forces the mean to 0. But the zg values are calculated using different means and SDs for each sound group, so I'm not sure why the mean of that new variable has also been set to 0.
The only situation where I can see this happening is if the sum of values > 0 is equal to sum of values < 0, which when averaged would cancel out to 0. This happens in a regular Z score calculation but I'm surprised that this also happens when operating across multiple groups like this...
I think it makes perfect sense. If E[abc | def ] is the expectation of abc given def), then in df['zg']:
m1 = E['zg' | sound = 'A'] = (0.872872 + 0.218218 -1.091089)/3 ~ 0
m2 = E['zg' | sound = 'B'] = (-0.725866 - 0.414781 + 1.140647)/3 ~ 0
and
E['zg'] = (m1+m2)/2 = (0.872872 + 0.218218 -1.091089 -0.725866 - 0.414781 + 1.140647)/6 ~ 0
Yes, this is expected behavior.
In fancy words, using the Law of Iterated Expectations,
And specifically, if groups Y are finite and thus countable,
where
However, by construction, every E[X|Y_j] is 0 for all values of Y in your set G of possible groups.
Thus, the total average will also be zero.
I have this optimization problem where I am trying to maximize column z based on a unique value from column X, but also within a constraint that each of the unique values picked of X added up column of Y most be less than or equal to (in this example) 23.
For example, I have this sample data:
X Y Z
1 9 25
1 7 20
1 5 5
2 9 20
2 7 10
2 5 5
3 9 10
3 7 5
3 5 5
The result should look like this:
X Y Z
1 9 25
2 9 20
3 5 5
This is replica for Set up linear programming optimization in R using LpSolve? with solution but I need the same in python.
For those who would want some help to get started with pulp in python can refer to http://ojs.pythonpapers.org/index.php/tppm/article/view/111
Github repo- https://github.com/coin-or/pulp/tree/master/doc/KPyCon2009 could be handy as well.
Below is the code in python for the dummy problem asked
import pandas as pd
import pulp
X=[1,1,1,2,2,2,3,3,3]
Y=[9,7,5,9,7,5,9,7,5]
Z=[25,20,5,20,10,5,10,5,5]
df = pd.DataFrame({'X':X,'Y':Y,'Z':Z})
allx = df['X'].unique()
possible_values = [(w,b) for w in allx for b in range(1,4)]
x = pulp.LpVariable.dicts('arr', (allx, range(1,4)),
lowBound = 0,
upBound = 1,
cat = pulp.LpInteger)
model = pulp.LpProblem("Optim", pulp.LpMaximize)
model += sum([x[w][b]*df[df['X']==w].reset_index()['Z'][b-1] for (w,b) in possible_values])
model += sum([x[w][b]*df[df['X']==w].reset_index()['Y'][b-1] for (w,b) in possible_values]) <= 23, \
"Maximum_number_of_Y"
for value in allx:
model += sum([x[w][b] for (w,b) in possible_values if w==value])>=1
for value in allx:
model += sum([x[w][b] for (w,b) in possible_values if w==value])<=1
##View definition
model
model.solve()
print("The choosen rows are out of a total of %s:"%len(possible_values))
for v in model.variables():
print v.name, "=", v.varValue
For solution in R
d=data.frame(x=c(1,1,1,2,2,2,3,3,3),y=c(9,7,5,9,7,5,9,7,5),z=c(25,20,5,20,10,5,10,5,3))
library(lpSolve)
all.x <- unique(d$x)
d[lp(direction = "max",
objective.in = d$z,
const.mat = rbind(outer(all.x, d$x, "=="), d$y),
const.dir = rep(c("==", "<="), c(length(all.x), 1)),
const.rhs = rep(c(1, 23), c(length(all.x), 1)),
all.bin = TRUE)$solution == 1,]