(Python) Estimating regression parameter confidence intervals with scikits bootstrap

(Python) Estimating regression parameter confidence intervals with scikits bootstrap - python

I've just started to try out a nice bootstrapping package available through scikits:
https://github.com/cgevans/scikits-bootstrap
but I've encountered a problem when trying to estimate confidence intervals for the correlation coefficient from linear regression. The confidence intervals returned lie completely outside the range of the original statistic.
Here is the code:
import numpy as np
from scipy import stats
import bootstrap as boot
np.random.seed(0)
x = np.arange(10)
y = 10 + 1.5*x + 2*np.random.randn(10)
r0 = stats.linregress(x, y)[2]
def my_function(y):
return stats.linregress(x, y)[2]
ci = boot.ci(y, statfunction=my_function, alpha=0.05, n_samples=1000, method='pi')
This yields a result of ci = [-0.605, 0.644], but the original statistic is r0=0.894.
I've tried this in R and it seems to work fine there: the ci straddles r0 as expected.
Please help!

Could you provide your R code? I'd be interested in knowing how this is dealt with in R.
The problem here is that you're only passing y to boot.ci, but every time it runs my_function, it uses the entire, original x (note the lack of x input to my_function). Bootstrapping applies the statistic function to resampled data, so if you're applying your statistic function using the original x and a sample of y, you're going to have a nonsensical result. This is why the BCA method doesn't work at all, actually: it can't apply your statistic function to jackknife samples, which don't have the same number of elements.
Samples are taken along axis 0 (rows), so if you want to pass multiple 1D arrays to your statistic function, you can use multiple columns: xy = vstack((x,y)).T would work, and then use a statfunction that takes data from those columns:
def my_function(xysample):
return stats.linregress(xysample[:,0], xysample[:,1])[2]
Alternatively, if you wanted to avoid messing with your data at all, you could define a function that operates on indexes, and then just pass indexes to boot.ci:
def my_function2(i):
return stats.linregress(x[i], y[i])[2]
boot.ci(np.arange(len(x)), statfunction=my_function2, alpha=0.05, n_samples=1000, method='pi')
Note that in either of these cases, BCA works, so you may as well use method='bca' unless you really do want to use percentage intervals; BCA is pretty much always better.
I do realize that both of these methods are less than ideal. Honestly, I've never had a need to pass multiple arrays like this to my statfunction, and the majority of people are likely using mean as their statfunction. I think the best idea here may be to allow lists of equally-size[0] arrays to be passed, eg, boot.ci([x,y],...), and then sample all of those at the same time and pass them all to the statfunction as separate arguments. In that case, you could just have a my_function(x,y). I'll see if I can do this, but if you can show me your R code, that would be great, as I'd like to see if there is a better way of dealing with this.
Update:
In the most recent version of scikits.bootstrap (v0.3.1), a tuple of arrays can be provided, and samples from them will be passed as separate arguments to statfunction. Additionally, statfunction can provide array output, and confidence intervals will be calculated for each point in the output. Thus, this is now very easy to do. The following will give confidence intervals for every output of linregress:
cis = boot.ci( (x,y), statfunction=stats.linregress )
cis[:,2] in this case will be the desired confidence interval.

Related

Exclude/Ignore data region in polynomial fit (zfit)

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:
import numpy as np
import pandas as pd
import zfit
# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])
# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)
# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))
# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)
TestSample.png
Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.
I also tried using MultiSpaces to ignore that region via
limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2
But this leads to a
ValueError: obs need to be a Space with exactly one limit if rescaling is requested.
Anyone has an idea how to solve this?
Thanks in advance ^^

Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.
What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.
This can normally be done in two ways:
Using multispace
using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).
Simultaneous fit
As mentioned in the comments by #jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).
So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.
Solution 1: different space and norm
Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.
Solution 2: manual re-scaling
The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

How do I improve a Gaussian/Normal fit in Python 3.X by using a running median?

I have an array of 100x100 data points, where I'm trying to perform a Gaussian fit to each column of 100 values in the array. I then want the parameters of the Gaussian found by using the fit of the first column to be the initial parameters of the starting point for the next column to use. Let's say I start with the initial parameters of 1000, 0, and 1, and the fit finds values of 800, 3, and 1.5. I then want the fitter to use these three parameters as initial values for the next column.
My code is:
x = np.linspace(-50,50,100)
Gauss_Model = models.Gaussian1D(amplitude = 1000., mean = 0, stddev = 1.)
Fitting_Model = fitting.LevMarLSQFitter()
Fit_Data = []
for i in range(0, Data_Array.shape[0]):
Fit_Data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
Right now it uses the same initial values for every fit. Does anyone know how to perform such a running median/mean for a Gaussian fitting method? Would really appreciate any help or being pointed in the right direction, thanks!

I'm not familiar with the specific library you are using, but if you can get your fitted parameters out with something like fit_data[-1].amplitude or fit_data[-1].mean, then you could modify your loop to use something like:
for i in range(0, data_array.shape[0]):
if fit_data: # true if not an empty list
Gauss_Model = models.Gaussian1D(amplitude=fit_data[-1].amplitude,
mean=fit_data[-1].mean,
stddev=fit_data[-1].stddev)
fit_data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
basically checking whether you have already fit a model, and if you have, use the most recent fitted amplitude, mean, and standard deviation as the starting point for your next Gauss_Model.
A thought: this might speed up your fitting, but it shouldn't result in a "better" fit to the 100 data points in each fit operation. Your resulting model is probably the best fit model to the data it was presented. If you want to estimate the error in the parameters of your model, you can use the fact that, for two normal distributions A ~ N(m_a, v_a) and B ~ N(m_b, v_b), the distribution A + B will have mean m_a + m_b and variance is v_a + v_b. Thus, the distribution of your means will be N(sum(means)/n, sum(variances)/n). Basically you can say that your true mean is centered at the mean of your means with standard deviation (sum(stddev)/sqrt(n)).

I also cannot tell what library you are using, and the details of how to do this probably depend on the details of how that library stores the fitted values. I can say that for lmfit (https://lmfit.github.io/lmfit-py/) we struggled with this sort of usage and arrived at a design that makes what you are trying to do pretty easy. With lmfit, you might compose this problem as:
import numpy as np
from lmfit import GaussianModel
x = np.linspace(-50,50,100)
# get Data_Array from somewhere....
# create a model for a Gaussian
Gauss_Model = GaussianModel()
# make a set of parameters, setting initial values
params = Gauss_Model.make_params(amplitude=1000, center=0, sigma=1.0)
Fit_Results = []
for i in range(Data_Array.shape[1]):
result = Gauss_Model.fit(Data_Array[:, i], params, x=x)
Fit_Results.append(result)
# update `params` with the current best fit params for the next column
params = result.params
Note that this works because lmfit is careful that Model.fit() will not alter the input parameters, and will put the resulting best-fit parameters for each fit in result.params.
And, if you decide you do want to have all columns use the original initial values, just comment out that last params = result.params.
Lmfit has a lot more bells and whistles, but I hope that helps you do what you need.

Exponential fit with the least squares Python

I have a very specific task, where I need to find the slope of my exponential function.
I have two arrays, one denoting the wavelength range between 400 and 750 nm, the other the absorption spectrum. x = wavelengths, y = absorption.
My fit function should look something like that:
y_mod = np.float(a_440) * np.exp(-S*(x - 440.))
where S is the slope and in the image equals 0.016, which should be in the range of S values I should get (+/- 0.003). a_440 is the reference absorption at 440 nm, x is the wavelength.
Modelled vs. original plot:
I would like to know how to define my function in order to get an exponential fit (not on log transformed quantities) of it without guessing beforehand what the S value is.
What I've tried so far was to define the function in such way:
def func(x, a, b):
return a * np.exp(-b * (x-440))
And it gives pretty nice matches
fitted vs original.
What I'm not sure is whether this approach is correct or should I do it differently?
How would one use also the least squares or the absolute differences in y approaches for minimization in order to remove the effect of overliers?
Is it possible to also add random noise to the data and recompute the fit?

Your situation is the same as the one described in the documentation for scipy's curve_fit.
The problem you're incurring is that your definition of the function accepts only one argument when it should receive three: x (the independent variable where the function is evaluated), plus a_440 and S.
Cleaning a bit, the function should be more like this.
def func(x, A, S):
return A*np.exp(-S*(x-440.))
It might be that you run into a warning about the covariance matrix. you solve that by providing a decent starting point to the curve_fit through the argument p0 and providing a list. For example in this case p0=[1,0.01] and in the fitting call it would look like the following
curve_fit(func, x, y, p0=[1,0.01])

Fitting Parametric Curves in Python

I have experimental data of the form (X,Y) and a theoretical model of the form (x(t;*params),y(t;*params)) where t is a physical (but unobservable) variable, and *params are the parameters that I want to determine. t is a continuous variable, and there is a 1:1 relationship between x and t and between y and t in the model.
In a perfect world, I would know the value of T (the real-world value of the parameter) and would be able to do an extremely basic least-squares fit to find the values of *params. (Note that I am not trying to "connect" the values of x and y in my plot, like in 31243002 or 31464345.) I cannot guarantee that in my real data, the latent value T is monotonic, as my data is collected across multiple cycles.
I'm not very experienced doing curve fitting manually, and have to use extremely crude methods without easy access to a basic scipy function. My basic approach involves:
Choose some value of *params and apply it to the model
Take an array of t values and put it into the model to create an array of model(*params) = (x(*params),y(*params))
Interpolate X (the data values) into model to get Y_predicted
Run a least-squares (or other) comparison between Y and Y_predicted
Do it again for a new set of *params
Eventually, choose the best values for *params
There are several obvious problems with this approach.
1) I'm not experienced enough with coding to develop a very good "do it again" other than "try everything in the solution space," of maybe "try everything in a coarse grid" and then "try everything again in a slightly finer grid in the hotspots of the coarse grid." I tried doing MCMC methods, but I never found any optimum values, largely because of problem 2
2) Steps 2-4 are super inefficient in their own right.
I've tried something like (resembling pseudo-code; the actual functions are made up). There are many minor quibbles that could be made about using broadcasting on A,B, but those are less significant than the problem of needing to interpolate for every single step.
People I know have recommended using some sort of Expectation Maximization algorithm, but I don't know enough about that to code one up from scratch. I'm really hoping there's some awesome scipy (or otherwise open-source) algorithm I haven't been able to find that covers my whole problem, but at this point I am not hopeful.
import numpy as np
import scipy as sci
from scipy import interpolate
X_data
Y_data
def x(t,A,B):
return A**t + B**t
def y(t,A,B):
return A*t + B
def interp(A,B):
ts = np.arange(-10,10,0.1)
xs = x(ts,A,B)
ys = y(ts,A,B)
f = interpolate.interp1d(xs,ys)
return f
N = 101
lsqs = np.recarray((N**2),dtype=float)
count = 0
for i in range(0,N):
A = 0.1*i #checks A between 0 and 10
for j in range(0,N):
B = 10 + 0.1*j #checks B between 10 and 20
f = interp(A,B)
y_fit = f(X_data)
squares = np.sum((y_fit - Y_data)**2)
lsqs[count] = (A,b,squares) #puts the values in place for comparison later
count += 1 #allows us to move to the next cell
i = np.argmin(lsqs[:,2])
A_optimal = lsqs[i][0]
B_optimal = lsqs[i][1]

If I understand the question correctly, the params are constants which are the same in every sample, but t varies from sample to sample. So, for example, maybe you have a whole bunch of points which you believe have been sampled from a circle
x = a+r cos(t)
y = b+r sin(t)
at different values of t.
In this case, what I would do is eliminate the variable t to get a relation between x and y -- in this case, (x-a)^2+(y-b)^2 = r^2. If your data fit the model perfectly, you would have (x-a)^2+(y-b)^2 = r^2 at each of your data points. With some error, you could still find (a,b,r) to minimize
sum_i ((x_i-a)^2 + (y_i-b)^2 - r^2)^2.
Mathematica's Eliminate command can automate the procedure of eliminating t in some cases.
PS You might do better at stats.stackexchange, math.stackexchange or mathoverflow.net . I know the last one has a scary reputation, but we don't bite, really!

Simple Automatic Classification of the (R-->R) Functions

Given data values of some real-time physical process (e.g. network traffic) it is to find a name of the function which "at best" matches with the data.
I have a set of functions of type y=f(t) where y and t are real:
funcs = set([cos, tan, exp, log])
and a list of data values:
vals = [59874.141, 192754.791, 342413.392, 1102604.284, 3299017.372]
What is the simplest possible method to find a function from given set which will generate the very similar values?
PS: t is increasing starting from some positive value by almost-equal intervals

Just write the error ( quadratic sum of error at each point for instance ) for each function of the set and choose the function giving the minimum.
But you should still fit each function before choosing

Scipy has functions for fitting data, but they use polynomes or splines. You can use one of Gauß' many discoveries, the method of least squares to fit other functions.

I would try an approach based on fitting too. For each of the four test functions (f1-4 see below), the values of a and b that minimizes the squared error.
f1(t) = a*cos(b*t)
f2(t) = a*tan(b*t)
f3(t) = a*exp(b*t)
f4(t) = a*log(b*t)
After fitting the squared error of the four functions can be used for evaluating the fit goodness (low values means a good fit).
If fitting is not allowed at all, the four functions can be divided into two distinct subgroups, repeating functions (cos and tan) and strict increasing functions (exp and log).
Strict increasing functions can be identified by checking if all the given values are increasing throughout the measuring interval.
In pseudo code an algorithm could be structured like
if(vals are strictly increasing):
% Exp or log
if(increasing more rapidly at the end of the interval):
% exp detected
else:
% log detected
else:
% tan or cos
if(large changes in vals over a short period is detected):
% tan detected
else:
% cos detected
Be aware that this method is not that stable and will be easy to trick into faulty conclusions.

See Curve Fitting

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.