Getting standard errors from regressions using rpy2 - python

I am using rpy2 for regressions. The returned object has a list that includes coefficients, residuals, fitted values, rank of the fitted model, etc.)
However I can't find the standard errors (nor the R^2) in the fit object. Running lm directly model in R, standard errors are displayed with the summary command, but I can't access them directly in the model's data frame.
How can I get extract this info using rpy2?
Sample python code is
from scipy import random
from numpy import hstack, array, matrix
from rpy2 import robjects
from rpy2.robjects.packages import importr
def test_regress():
stats=importr('stats')
x=random.uniform(0,1,100).reshape([100,1])
y=1+x+random.uniform(0,1,100).reshape([100,1])
x_in_r=create_r_matrix(x, x.shape[1])
y_in_r=create_r_matrix(y, y.shape[1])
formula=robjects.Formula('y~x')
env = formula.environment
env['x']=x_in_r
env['y']=y_in_r
fit=stats.lm(formula)
coeffs=array(fit[0])
resids=array(fit[1])
fitted_vals=array(fit[4])
return(coeffs, resids, fitted_vals)
def create_r_matrix(py_array, ncols):
if type(py_array)==type(matrix([1])) or type(py_array)==type(array([1])):
py_array=py_array.tolist()
r_vector=robjects.FloatVector(flatten_list(py_array))
r_matrix=robjects.r['matrix'](r_vector, ncol=ncols)
return r_matrix
def flatten_list(source):
return([item for sublist in source for item in sublist])
test_regress()

So this seems to work for me:
def test_regress():
stats=importr('stats')
x=random.uniform(0,1,100).reshape([100,1])
y=1+x+random.uniform(0,1,100).reshape([100,1])
x_in_r=create_r_matrix(x, x.shape[1])
y_in_r=create_r_matrix(y, y.shape[1])
formula=robjects.Formula('y~x')
env = formula.environment
env['x']=x_in_r
env['y']=y_in_r
fit=stats.lm(formula)
coeffs=array(fit[0])
resids=array(fit[1])
fitted_vals=array(fit[4])
modsum = base.summary(fit)
rsquared = array(modsum[7])
se = array(modsum.rx2('coefficients')[2:4])
return(coeffs, resids, fitted_vals, rsquared, se)
Although, as I said, this is literally my first foray into RPy2, so there may be a better way to do that. But this version appears to output arrays containing the R-squared value along with the standard errors.
You can use print(modsum.names) to see the names of the components of the R object (kind of like names(modsum) in R) and then .rx and .rx2 are the equivalent of [ and [[ in R.

#joran: Pretty good. I'd say that it is pretty much the way to do it.
from rpy2 import robjects
from rpy2.robjects.packages import importr
base = importr('base')
stats = importr('stats') # import only once !
def test_regress():
x = base.matrix(stats.runif(100), nrow = 100)
y = (x.ro + base.matrix(stats.runif(100), nrow = 100)).ro + 1 # not so nice
formula = robjects.Formula('y~x')
env = formula.environment
env['x'] = x
env['y'] = y
fit = stats.lm(formula)
coefs = stats.coef(fit)
resids = stats.residuals(fit)
fitted_vals = stats.fitted(fit)
modsum = base.summary(fit)
rsquared = modsum.rx2('r.squared')
se = modsum.rx2('coefficients')[2:4]
return (coefs, resids, fitted_vals, rsquared, se)

Related

Unable to move R analyses output back to Python (rpy2)

I am attempting to pass some data from python to R and then retun the results to a python but can't seem to get it to work.
I am successful in passing my data to R and running my custom function on the data and even get the output. Where I am stuck is getting the statistical output back into python as a dataframe. I have tried using rpy2 and even exporting it to a .csv file to re-import but can't get either method to work. When I try and push it back to pandas I get an error that is cant be coerced. When it comes to saving to a .csv I can't seem to get it to work using my "results" object. In reading it seems that checking what is in the R global environment may help me figure it out but I haven't been able to figure out how to do that either.
Any helpful comments are appreciated.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
name = 'test_subject'
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function
curve_fit = robjects.r('''
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)
}
return(list("params" = summary(fit),
"r2" = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
#Show Results
print('results', results)
print(type(results))
The problem was in
return(list("params" = summary(fit), "r2" = cor(predict(fit), df$y)^2))
The first item in the list "params" was a summary table from R. While this printed in python as the data I wanted it was a single object that could not be subdivided as it was essentially and image of an R output table. What I needed to return was a dataframe as shown in the code below.
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
This returned a list of objects that I could then convert to a numpy array and manipulate in python.
Here is the full code.
#import statements
import rpy2
print(rpy2.__version__)
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
import numpy as np
rpy2.robjects.numpy2ri.activate()
base = importr('base')
utils = importr('utils')
#Sample data to analyze
list1 = [0,1,2,3,4,5,6,7,8,9,10] # analysis window
list2 = [1,5,6,8,7,9,10,8,7,6,3] # nnumber of responses per bin
#Convert data to R objects and place in data frame
set1 = robjects.IntVector(list1)
set2 = robjects.IntVector(list2)
makeDataFrame = robjects.r('''data.frame ''')
df = makeDataFrame(x = set1, y = set2)
# Create curve fitting function in r
curve_fit = robjects.r('''
#Fit function
curve_fit <- function(df, plot = FALSE){ control <- nls.control(maxiter = 1000, tol = 0.000100, minFactor = 1/2064,
printEval = FALSE, warnOnly = TRUE)
#Specify formula to fit
fit <- nls(y ~ d+a*exp(-.5*((x-t0)/b)^2)+c*(x-t0),
data = df,
start = list(a = 1, b = 10, t0 = 10, c = 1, d = 1),
algorithm = "port",
control = control)
# Create plot of curve
if (plot){
fitFnc <- function(x) predict(fit, list(x=x))
par(mfrow = c(1, 1))
plot(df$x, df$y, xlim = c(0,45))
curve(fitFnc, from=.5, to=45, add = TRUE)}
#returns data in R dataframe
return(data.frame(coef(summary(fit)), r2 = cor(predict(fit), df$y)^2))
}''')
#run function on data
results = curve_fit(df, plot = True)
results = np.array(results) #convert to numpy array
#Show Results
print('results', results)
print(type(results))

Non-Linearity Test NaN error

I wanted to try statsmodel's linear_harvey_collier test with an easy example. However, I get nan as a result. Can you see, where my error lies?
import numpy as np
from statsmodels.regression.linear_model import OLS
np.random.seed(44)
n_samples, n_features = 50, 4
X = np.random.randn(n_samples, n_features)
coef=np.random.uniform(-12,12,4)
y = np.dot(X, coef)
var = 400
y += var**(1/2) * np.random.normal(size=n_samples)
regr=OLS(y, X).fit()
print(regr.params)
print(regr.summary())
sms.linear_harvey_collier(regr)
I get the result Ttest_1sampResult(statistic=nan, pvalue=nan).
If I perform the test while exluding one variable I get a result:
X3=X[:,:3]
regr3=OLS(y, X3).fit()
In [1]: sms.linear_harvey_collier(regr3)
Out[2]: Ttest_1sampResult(statistic=0.2447803429683807, pvalue=0.806727747845282)
Is there a problem with not adding a constant and intercept? This is just a feeling and if there is indeed a problem, I don't understand why.
There is a bug in linear_harvey_collier, that hard codes the number of initial observations to 3.
https://github.com/statsmodels/statsmodels/pull/6727
linear_harvey_collier has only two lines of code.
A workaround is to compute the test directly
res = regr
from scipy import stats
skip = len(res.params) # bug in linear_harvey_collier
rr = sms.recursive_olsresiduals(res, skip=skip, alpha=0.95, order_by=None)
stats.ttest_1samp(rr[3][skip:], 0)
Ttest_1sampResult(statistic=0.03092937323130299, pvalue=0.9754626388210277)

Use Python lmfit with a variable number of parameters in function

I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())
I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)
I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()

How to specify size for bernoulli distribution with pymc3?

In trying to make my way through Bayesian Methods for Hackers, which is in pymc, I came across this code:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, size=N)
I've tried to translate this to pymc3 with the following, but it just returns a numpy array, rather than a tensor (?):
first_coin_flips = pm.Bernoulli("first_flips", 0.5).random(size=50)
The reason the size matters is that it's used later on in a deterministic variable. Here's the entirety of the code that I have so far:
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
import mpld3
import theano.tensor as tt
model = pm.Model()
with model:
N = 100
p = pm.Uniform("cheating_freq", 0, 1)
true_answers = pm.Bernoulli("truths", p)
print(true_answers)
first_coin_flips = pm.Bernoulli("first_flips", 0.5)
second_coin_flips = pm.Bernoulli("second_flips", 0.5)
# print(first_coin_flips.value)
# Create model variables
def calc_p(true_answers, first_coin_flips, second_coin_flips):
observed = first_coin_flips * true_answers + (1-first_coin_flips) * second_coin_flips
# NOTE: Where I think the size param matters, since we're dividing by it
return observed.sum() / float(N)
calced_p = pm.Deterministic("observed", calc_p(true_answers, first_coin_flips, second_coin_flips))
step = pm.Metropolis(model.free_RVs)
trace = pm.sample(1000, tune=500, step=step)
pm.traceplot(trace)
html = mpld3.fig_to_html(plt.gcf())
with open("output.html", 'w') as f:
f.write(html)
f.close()
And the output:
The coin flips and uniform cheating_freq output look correct, but the observed doesn't look like anything to me, and I think it's because I'm not translating that size param correctly.
The pymc3 way to specify the size of a Bernoulli distribution is by using the shape parameter, like:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, shape=N)

How to implement simple Monte Carlo function in pymc

I'm trying to get my head around how to implement a Monte Carlo function in python using pymc to replicate a spreadsheet by Douglas Hubbard in his book How to Measure Anything
My attempt was:
import numpy as np
import pandas as pd
from pymc import DiscreteUniform, Exponential, deterministic, Poisson, Uniform, Normal, Stochastic, MCMC, Model
maintenance_saving_range = DiscreteUniform('maintenance_saving_range', lower=10, upper=21)
labour_saving_range = DiscreteUniform('labour_saving_range', lower=-2, upper=9)
raw_material_range = DiscreteUniform('maintenance_saving_range', lower=3, upper=10)
production_level_range = DiscreteUniform('maintenance_saving_range', lower=15000, upper=35000)
#deterministic(plot=False)
def rate(m = maintenance_saving_range, l = labour_saving_range, r=raw_material_range, p=production_level_range):
return (m + l + r) * p
model = Model([rate, maintenance_saving_range, labour_saving_range, raw_material_range, production_level_range])
mc = MCMC(model)
Unfortunately, I'm getting an error: ValueError: A tallyable PyMC object called maintenance_saving_range already exists. This will cause problems for some database backends.
What have I got wrong?
Ah, it was a copy and paste error.
I'd called three distributions by the same name.
Here's the code that works.
import numpy as np
import pandas as pd
from pymc import DiscreteUniform, Exponential, deterministic, Poisson, Uniform, Normal, Stochastic, MCMC, Model
%matplotlib inline
import matplotlib.pyplot as plt
maintenance_saving_range = DiscreteUniform('maintenance_saving_range', lower=10, upper=21)
labour_saving_range = DiscreteUniform('labour_saving_range', lower=-2, upper=9)
raw_material_range = DiscreteUniform('raw_material_range', lower=3, upper=10)
production_level_range = DiscreteUniform('production_level_range', lower=15000, upper=35000)
#deterministic(plot=False, name="rate")
def rate(m = maintenance_saving_range, l = labour_saving_range, r=raw_material_range, p=production_level_range):
#out = np.empty(10000)
out = (m + l + r) * p
return out
model = Model([rate, maintenance_saving_range, labour_saving_range, raw_material_range])
mc = MCMC(model)
mc.sample(iter=10000)

Categories