In R, you can get confidence intervals for each coefficient in a Logistic Regression as shown here (https://www.r-bloggers.com/example-9-14-confidence-intervals-for-logistic-regression-models/).
Can you do this in sci-kit learn in Python? I was exploring, but I couldn't find a way.
I don't think you can get that from sci-kit learn, one option is to use statsmodels in python, which is very similar to R:
import statsmodels.api as sm
import pandas as pd
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
header=None,names=["s_wid","s_len","p_wid","p_len","species"])
y = np.array(df['species'] == "Iris-virginica").astype(int)
X = sm.add_constant(df.iloc[:,:4])
model = sm.Logit(y, X)
result = model.fit()
result.summary()
Logit Regression Results
Dep. Variable: y No. Observations: 150
Model: Logit Df Residuals: 145
Method: MLE Df Model: 4
Date: Wed, 17 Jun 2020 Pseudo R-squ.: 0.9377
Time: 00:25:21 Log-Likelihood: -5.9493
converged: True LL-Null: -95.477
Covariance Type: nonrobust LLR p-value: 1.189e-37
coef std err z P>|z| [0.025 0.975]
const -42.6378 25.708 -1.659 0.097 -93.024 7.748
s_wid -2.4652 2.394 -1.030 0.303 -7.158 2.228
s_len -6.6809 4.480 -1.491 0.136 -15.461 2.099
p_wid 9.4294 4.737 1.990 0.047 0.145 18.714
p_len 18.2861 9.743 1.877 0.061 -0.809 37.381
Related
I have two scatterplots that I've placed on one plot. I want to find the linear regression line for the points of y1 and y2 combined (as in the regression between x and (y1 and y2) ), but I'm having difficulty since I usually only find the regression line for y1 or y2 separately. I also want to find the r^2 value (for the combined y1 and y2). I would appreciate any help I can get!
df1 = pd.DataFrame(np.random.randint(0,100,size=(15, 2)), columns=list('AB'))
y1 = df1['A']
y2 = df1['B']
plt.scatter(df1.index, y1)
plt.scatter(df1.index, y2)
plt.show()
Sounds like you want to 'stack' columns A and B together; many ways to do it, here is one using stack:
df2 = df1.stack().rename('A_and_B').reset_index(level = 1, drop = True).to_frame()
Then df.head() looks like this:
A_and_B
0 35
0 58
1 49
1 73
2 44
and the scatter plot:
plt.scatter(df2.index, df2['A_and_B'])
looks like
I don't know how you do regressions, you can apply your method to df2 now. For example:
import statsmodels.api as sm
res = sm.OLS(df2['A_and_B'], df2.index).fit()
res.summary()
output:
OLS Regression Results
=======================================================================================
Dep. Variable: A_and_B R-squared (uncentered): 0.517
Model: OLS Adj. R-squared (uncentered): 0.501
Method: Least Squares F-statistic: 31.10
Date: Mon, 14 Mar 2022 Prob (F-statistic): 5.11e-06
Time: 23:02:47 Log-Likelihood: -152.15
No. Observations: 30 AIC: 306.3
Df Residuals: 29 BIC: 307.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 4.8576 0.871 5.577 0.000 3.076 6.639
==============================================================================
Omnibus: 3.466 Durbin-Watson: 1.244
Prob(Omnibus): 0.177 Jarque-Bera (JB): 1.962
Skew: -0.371 Prob(JB): 0.375
Kurtosis: 1.990 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
I need to perform multiple polynomial regression and obtain statistics, p value, AIC etc.
As far as I understood I can do that with OLS, however I found only a way to produce a formula using one independent variable, like this:
model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)
I tried to define a formula using two independent variable, however I could not understand if that is the correct way and if the results are reasonable. The line that I doubt is model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2). The full code is this one:
import pandas as pd
import statsmodels.formula.api as smf
train = pd.read_csv(r'W:\...file.csv')
model = 'Height ~ Diamet + I(Diamet**2) + area + I(area**2)'
hours_model = smf.ols(formula = model, data = train).fit()
print(hours_model.summary())
The summary of the regression is here:
OLS Regression Results
==============================================================================
Dep. Variable: Height R-squared: 0.611
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 376.0
Date: Fri, 04 Feb 2022 Prob (F-statistic): 1.33e-194
Time: 08:50:17 Log-Likelihood: -5114.6
No. Observations: 963 AIC: 1.024e+04
Df Residuals: 958 BIC: 1.026e+04
Df Model: 4
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 13.9287 60.951 0.229 0.819 -105.684 133.542
Diamet 0.6027 0.340 1.770 0.077 -0.066 1.271
I(Diamet ** 2) 0.0004 0.002 0.262 0.794 -0.003 0.004
area 3.3553 5.307 0.632 0.527 -7.060 13.771
I(area** 2) 0.2519 0.108 2.324 0.020 0.039 0.465
==============================================================================
Omnibus: 60.996 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.039
Skew: 0.528 Prob(JB): 2.07e-19
Kurtosis: 4.015 Cond. No. 4.45e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
I've looked through the documentation and still can't figure this out. I want to run a WLS with multiple regressions.
statsmodels.api is imported as sm
Example of single variable.
X = Height
Y = Weight
res = sm.OLS(Y,X,).fit()
res.summary()
Say I also have:
X2 = Age
How do I add this into my regresssion?
You can put them into a data.frame and call out the columns (this way the output looks nicer too):
import statsmodels.api as sm
import pandas as pd
import numpy as np
Height = np.random.uniform(0,1,100)
Weight = np.random.uniform(0,1,100)
Age = np.random.uniform(0,30,100)
df = pd.DataFrame({'Height':Height,'Weight':Weight,'Age':Age})
res = sm.OLS(df['Height'],df[['Weight','Age']]).fit()
In [10]: res.summary()
Out[10]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
=======================================================================================
Dep. Variable: Height R-squared (uncentered): 0.700
Model: OLS Adj. R-squared (uncentered): 0.694
Method: Least Squares F-statistic: 114.3
Date: Mon, 24 Aug 2020 Prob (F-statistic): 2.43e-26
Time: 15:54:30 Log-Likelihood: -28.374
No. Observations: 100 AIC: 60.75
Df Residuals: 98 BIC: 65.96
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Weight 0.1787 0.090 1.988 0.050 0.000 0.357
Age 0.0229 0.003 8.235 0.000 0.017 0.028
==============================================================================
Omnibus: 2.938 Durbin-Watson: 1.813
Prob(Omnibus): 0.230 Jarque-Bera (JB): 2.223
Skew: -0.211 Prob(JB): 0.329
Kurtosis: 2.404 Cond. No. 49.7
==============================================================================
I use a 2nd order polynomial to predict how height and age affect weight for a soldier. You can pick up ansur_2_m.csv on my GitHub.
df=pd.read_csv('ANSUR_2_M.csv', encoding = "ISO-8859-1", usecols=['Weightlbs','Heightin','Age'], dtype={'Weightlbs':np.integer,'Heightin':np.integer,'Age':np.integer})
df=df.dropna()
df.reset_index()
df['Heightin2']=df['Heightin']**2
df['Age2']=df['Age']**2
formula="Weightlbs ~ Heightin+Heightin2+Age+Age2"
model_ols = smf.ols(formula,data=df).fit()
minHeight=df['Heightin'].min()
maxHeight=df['Heightin'].max()
avgAge = df['Age'].median()
print(minHeight,maxHeight,avgAge)
df2=pd.DataFrame()
df2['Heightin']=np.linspace(60,100,50)
df2['Heightin2']=df2['Heightin']**2
df2['Age']=28
df2['Age2']=df['Age']**2
df3=pd.DataFrame()
df3['Heightin']=np.linspace(60,100,50)
df3['Heightin2']=df2['Heightin']**2
df3['Age']=45
df3['Age2']=df['Age']**2
prediction28=model_ols.predict(df2)
prediction45=model_ols.predict(df3)
plt.clf()
plt.plot(df2['Heightin'],prediction28,label="Age 28")
plt.plot(df3['Heightin'],prediction45,label="Age 45")
plt.ylabel="Weight lbs"
plt.xlabel="Height in"
plt.legend()
plt.show()
print('A 45 year old soldier is more probable to weight more than an 28 year old soldier')
I am trying to run a regression some data from a dataframe, but I keep getting this weird shape error. Any idea what is wrong?
import pandas as pd
import io
import requests
import statsmodels.api as sm
# Read in a dataset
url="https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Select feature columns
X = df[['Body', 'Clean.Cup']]
# Select dv column
y = df['Cupper.Points']
# make model
mod = sm.OLS(X, y).fit()
I get this error:
shapes (1311,2) and (1311,2) not aligned: 2 (dim 1) != 1311 (dim 0)
You have your X and y terms in the wrong order in your sm.OLS command:
import pandas as pd
import io
import requests
import statsmodels.api as sm
# Read in a dataset
url="https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
# Select feature columns
X = df[['Body', 'Clean.Cup']]
# Select dv column
y = df['Cupper.Points']
# make model
mod = sm.OLS(y, X).fit()
mod.summary()
runs and returns
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: Cupper.Points R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 3.145e+05
Date: Sat, 06 Jul 2019 Prob (F-statistic): 0.00
Time: 19:42:59 Log-Likelihood: -454.94
No. Observations: 1311 AIC: 913.9
Df Residuals: 1309 BIC: 924.2
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Body 0.8464 0.016 53.188 0.000 0.815 0.878
Clean.Cup 0.1154 0.012 9.502 0.000 0.092 0.139
==============================================================================
Omnibus: 537.879 Durbin-Watson: 1.710
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30220.027
Skew: 1.094 Prob(JB): 0.00
Kurtosis: 26.419 Cond. No. 26.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
The order of y and X is wrong.
sm.OLS(y,X)
I have a dataframe that looks like:
I have applied Logistic regression and I want to have the p-score and t value in another dataframe
Algorithm Success
A 0.91
B 0.98
C 0.76
.
.
.
B 0.77
C 0.68
D 0.43
Code:
p1_logit_model=sm.MNLogit(group["Algorithm"], group["Success"].astype(float))
Output:
Results: MNLogit
===============================================================
Model: MNLogit Pseudo R-squared: 0.104
Dependent Variable: algorithm AIC: 184.2255
Date: 2018-12-18 17:19 BIC: 194.2622
No. Observations: 55 Log-Likelihood: -87.113
Df Model: 0 LL-Null: -97.227
Df Residuals: 50 LLR p-value: nan
Converged: 1.0000 Scale: 1.0000
No. Iterations: 9.0000
--------------------------------------------------------------
algorithm = 0 Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------
p1_less100ms 0.2326 0.5804 0.4008 0.6886 -0.9050 1.3702
--------------------------------------------------------------
algorithm = 1 Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------
p1_less100ms -6.3891 3.9519 -1.6167 0.1059 -14.1346 1.3565
I want to store the p-value and t-score for each in to a algorithm, can any one help me how?
I think you need to fit the model first to access the p-values and t-values. Try this:
fit = p1_logit_model.fit()
print(fit.pvalues[i])
print(fit.tvalues[i])
where i is the index for whichever category you're interested in looking at from the multinomial model. As a tip, if you're really looking to use a logistic regression model, you should be using model = sm.Logit(y, X) instead.