Why is minimize_scalar not minimizing correctly? - python

I am a new Python user, so bear with me if this question is obvious.
I am trying to find the value of lmbda that minimizes the following function, given a fixed vector Z and scalar sigma:
def sure_sft(z,lmbda, sigma):
indicator = np.abs(z) <= lmbda;
minimum = np.minimum(z**2,lmbda**2);
return -sigma**2*np.sum(indicator) + np.sum(minimum);
When I pass in values of lmbda manually, I find that the function produces the correct value of sure_stf. However, when I try to use the following code to find the value of lmbda that minimizes sure_stf:
minimize_scalar(lambda lmbda: sure_sft(Z, lmbda, sigma))
it gives me an incorrect value for sure_stf (-8.6731 for lmbda = 0.4916). If I pass in 0.4916 manually to sure_sft, I obtain -7.99809 instead. What am I doing incorrectly? I would appreciate any advice!
EDIT: I've pasted my code below. The data is from: https://web.stanford.edu/~chadj/HallJones400.asc
import pandas as pd
import numpy as np
from scipy.optimize import minimize_scalar
# FUNCTIONS
# Calculate orthogonal projection of g onto f
def proj(f, g):
return ( np.dot(f,g) / np.dot(f,f) ) * f
def gs(X):
# Copy of X -- will be used to store orthogonalization
F = np.copy(X)
# Orthogonalize design matrix
for i in range(1, X.shape[1]): # Iterate over columns of X
for j in range(i): # Iterate over columns less than current one
F[:,i] -= proj(F[:,j], X[:,i]) # Subtract projection of x_i onto f_j for all j<i from F_i
# normalize each column to have unit length
norm_F=( (F**2).mean(axis=0) ) ** 0.5 # Row vector with sqrt root of average of the squares of each column
W = F/norm_F # Normalize
return W
# SURE for soft-thresholding
def sure_sft(z,lmbda, sigma):
indicator = np.abs(z) <= lmbda
minimum = np.minimum(z**2,lmbda**2)
return -sigma**2*np.sum(indicator) + np.sum(minimum)
# Import data.
data_raw = pd.read_csv("hall_jones1999.csv")
# Drop missing observations.
data = data_raw.dropna(subset=['logYL', 'Latitude'])
Y = data['logYL']
Y = np.array(Y)
N = Y.size
# Create design matrix.
design = np.empty([data['Latitude'].size,15])
design[:,0] = 1
for j in range(1, 15):
design[:,j] = data['Latitude']**j
K = design.shape[1]
# Use Gramm-Schmidt on design matrix.
W = gs(design)
Z = np.dot(W.T, Y)/N
# MLE
mu_mle = np.dot(W, Z)
# Soft-thresholding
# Use MLE residuals to calculate sigma for SURE calculation
sigma = np.sqrt(np.sum((Y - mu_mle)**2)/(N-K))
# Write SURE as a function of lmbda
sure = lambda lmbda: sure_sft(Z, lmbda, sigma)
# Find SURE-minimizing lmbda
lmbda = minimize_scalar(sure).x
min_sure = minimize_scalar(sure).fun #-8.673172212265738
# Compare to manually inputting minimized lambda into sure_sft
# I'm s
act_sure1 = sure_sft(Z, 0.49167598, sigma) #-7.998060514873529
act_sure2 = sure_sft(Z, 0.491675989, sigma) #-8.673172212306728

You're actually not doing anything wrong. I just tested out the code and confirmed that lmbda has a value of 0.4916759890416824 at the end of the script. You can confirm this for yourself by adding the following lines to the bottom of your script:
print(lmbda)
print(sure_sft(Z, lmbda, sigma))
when you run your script you should then see:
0.4916759890416824
-8.673158394698172
The only thing I can figure is that somehow the routine you were using to print out lmbda was set up to only print a fixed number of digits of floating point numbers, or somehow the printout was otherwise truncated.

Related

Weird results obtained while solving a set of coupled differential equations (using a sparse array) in python

I have tried to no avail for a week while trying to solve a system of coupled differential equations and reproduce the results shown in the attached image. I seem to be getting weird results as shown also. I don't seem to know what I might be doing wrong.The set of coupled differential equations were solved using Newman's BAND. Here's a link to the python implementation: python solution using BAND . And another link to the original image of the problem in case the attached is not clear enough: here you find a clearer image of the problem. Now what I am trying to do is to solve the same problem by creating a sparse array directly from the discretized equations using a combination of sympy and numpy and then solving using scipy's spsolve. Here is my code below. I need some help to figure out what I am doing wrong.
I have represented the variables as c1 = cA, c2 = cB, c3 = cC, c4 = cD in my code. Equation 2 has been linearized and phi10 and phi20 are the trial values of the variables cC and cD.
# import modules
import numpy as np
import sympy
from sympy.core.function import _mexpand
import scipy as sp
import scipy.sparse as ss
import scipy.sparse.linalg as ssl
import matplotlib.pyplot as plt
# define functions
def flatten(t):
"""
function to flatten lists
"""
return [item for sublist in t for item in sublist]
def get_coeffs(coeff_dict, func_vars):
"""
function to extract coefficients from variables
and form the sparse symbolic array
"""
c = coeff_dict
for i in list(c.keys()):
b, _ = i.as_base_exp()
if b == i:
continue
if b in c:
c[i] = 0
if any(k.has(b) for k in c):
c[i] = 0
return [coeff_dict[val] for val in func_vars]
# Constants for the problem
I = 0.1 # A/cm2
L = 1.0 # distance (x) in cm
m = 100 # grid spacing
h = L / (m-1)
a = 23300 # 1/cm
io = 2e-7 # A/cm2
n = 1
F = 96500 # C/mol
R = 8.314 # J/mol-K
T = 298 # K
sigma = 20 # S/cm
kappa = 0.06 # S/cm
alpha = 0.5
beta = -(1-alpha)*n*F/R/T
phi10 , phi20 = 5, 0.5 # these are just guesses
P = a*io*np.exp(beta*(phi10-phi20))
j = sympy.symbols('j',integer = True)
cA = sympy.IndexedBase('cA')
cB = sympy.IndexedBase('cB')
cC = sympy.IndexedBase('cC')
cD = sympy.IndexedBase('cD')
# write the boundary conditions at x = 0
bc=[cA[1], cB[1],
(4/3) * cC[2] - (1/3)*cC[3], # use a three point approximation for cC_prime
cD[1]
]
# form a list of expressions from the boundary conditions and equations
expr=flatten([bc,flatten([[
-cA[j-1] - cB[j-1] + cA[j+1] + cB[j+1],
cB[j-1] - 2*h*P*beta*cC[j] + 2*h*P*beta*cD[j] - cB[j+1],
-sigma*cC[j-1] + 2*h*cA[j] + sigma * cC[j+1],
-kappa * cD[j-1] + 2*h * cB[j] + kappa * cD[j+1]] for j in range(2, m)])])
vars = [cA[j], cB[j], cC[j], cD[j]]
# flatten the list of variables
unknowns = flatten([[cA[j], cB[j], cC[j], cD[j]] for j in range(1,m)])
var_len = len(unknowns)
# # # substitute in the boundary conditions at x = L while getting the coefficients
A = sympy.SparseMatrix([get_coeffs(_mexpand(i.subs({cA[m]:I}))\
.as_coefficients_dict(), unknowns) for i in expr])
# convert to a numpy array
mat_temp = np.array(A).astype(np.float64)
# you can view the sparse array with this
fig = plt.figure(figsize=(6,6))
ax = fig.add_axes([0,0, 1,1])
cmap = plt.cm.binary
plt.spy(mat_temp, cmap = cmap, alpha = 0.8)
def solve_sparse(b0, error):
# create the b column vector
b = np.copy(b0)
b[0:4] = np.array([0.0, I, 0.0, 0.0])
b[var_len-4] = I
b[var_len-3] = 0
b[var_len-2] = 0
b[var_len-1] = 0
print(b.shape)
old = np.copy(b0)
mat = np.copy(mat_temp)
b_2 = np.copy(b)
resid = 10
lss = 0
while lss < 100:
mat_2 = np.copy(mat)
for j in range(3, var_len - 3, 4):
# update the forcing term of equation 2
b_2[j+2] = 2*h*(1-beta*old[j+3]+beta*old[j+4])*a*io*np.exp(beta*(old[j+3]-old[j+4]))
# update the sparse array at every iteration for variables cC and cD in equation2
mat_2[j+2, j+3] += 2*h*beta*a*io*np.exp(beta*(old[j+3]-old[j+4]))
mat_2[j+2, j+4] += 2*h*beta*a*io*np.exp(beta*(old[j+3]-old[j+4]))
# form the column sparse matrix
A_s = ss.csc_matrix(mat_2)
new = ssl.spsolve(A_s, b_2).flatten()
resid = np.sum((new - old)**2)/var_len
lss += 1
old = np.copy(new)
return new
val0 = np.array([[0.0, 0.0, 0.0, 0.0] for _ in range(m-1)]).flatten() # form an array of initial values
error = 1e-7
## Run the code
conc = solve_sparse(val0, error).reshape(m-1, len(vars))
conc.shape # gives (99, 4)
# Plot result for cA:
plt.plot(conc[:,0], marker = 'o', linestyle = '')
What happens seems pretty clear now, after having seen that the plotted solution indeed oscillates between the upper and lower values. You are using the central Euler method as discretization, for u'=F(u) this reads as
u[j+1]-u[j-1] = 2*h*F(u[j])
This method is only weakly stable and allows the sub-sequences of odd and even indices to evolve rather independently. As equation this would mean that the solution might approximate the system ue'=F(uo), uo'=F(ue) with independent functions ue, uo that follow the path of the even or odd sub-sequence.
These even and odd parts are only tied together by the treatment of the boundary points, two or three points deep. So to avoid or reduce the oscillation requires a very careful handling of boundary conditions and also the differential equations for the boundary points.
But one can avoid all this unpleasantness by using the trapezoidal method
u[j+1]-u[j] = 0.5*h*(F(u[j+1])+F(u[j]))
This also reduces the band-width of the system matrix.
To properly implement the implied Newton method correctly (linearizing via Taylor and solving the linearized equation is what the Newton-Kantorovich method does) you need to replace F(u[j]) with F(u_old[j])+F'(u_old[j])*(u[j]-u_old[j]). This then gives a linear system of equations in u for the iteration step.
For the trapezoidal method this gives
(I-0.5*h*F'(u_old[j+1]))*u[j+1] - (I+0.5*h*F'(u_old[j]))*u[j]
= 0.5*h*(F(u_old[j+1])-F'(u_old[j+1])*u_old[j+1] + F(u_old[j])-F'(u_old[j])*u_old[j])
In general, the derivatives values and thus the system matrix need not be updated every step, only the function value (else the iteration does not move forward).

Euler's method for Python

I want to approximate the solutions of dy/dx = -x +1, with eulers method on the interval from 0 to 2. I'm using this code
def f(x):
return -x+1 # insert any function here
x0 = 1 # Initial slope #
dt = 0.1 # time step
T = 2 # ...from...to T
t = np.linspace(0, T, int(T/dt) + 1) # divide the interval from 0 to 2 by dt
x = np.zeros(len(t))
x[0] = x0 # value at 1 is the initial slope
for i in range(1, len(t)): # apply euler method
x[i] = x[i-1] + f(x[i-1])*dt
plt.figure() # plot the result
plt.plot(t,x,color='blue')
plt.xlabel('t')
plt.ylabel('y(t)')
plt.show()
Can I use this code to approximate the solutions of any function on any interval? It's hard to see whether this actually works, because I don't know how to plot the actual solution ( -1/2x^2 + x ) along side the approximation.
It would probably help if you consistently used the same variable names for the same role. Per your output, the solution is y(t). Thus your differential equation should be dy(t)/dt = f(t,y(t)). This would then give an implementation for the slope function and its exact solution
def f(t,y): return 1-t
def exact_y(t,t0,y0): return y0+0.5*(1-t0)**2-0.5*(1-t)**2
Then implement the Euler loop also as a separate function, keeping out problem specific details as much as possible
def Eulerint(f,t0,y0,te,dt):
t = np.arange(t0,te+dt,dt)
y = np.zeros(len(t))
y[0] = y0
for i in range(1, len(t)): # apply euler method
y[i] = y[i-1] + f(t[i-1],y[i-1])*dt
return t,y
Then plot the solutions as
y0,T,dt = 1,2,0.1
t,y = Eulerint(f,0,y0,T,dt)
plt.plot(t,y,color='blue')
plt.plot(t,exact_y(t,0,y0),color='orange')
You can just plot the actual solution by using:
def F(x):
return -0.5*x+x
# some code lines
plt.plot(t,x,color='blue')
plt.plot(t,F(t),color='orange')
But please note that the actual solution (-1/2x+x = 1/2x) does not correspond to your slope f(x) and will show a different solution.
The *real slope f(x) of the actual solution (-1/2x+x = 1/2x) is just simply f(x)=1/2

RMS value of a function

Now the full code / questions
I would like to estimate the random fluctuations of the function v - therefore I would like to calculate the RMS value of it:
import numpy as np
import matplotlib.pyplot as plt
def HHmodel(I,length, area):
v = []
m = []
h = []
z = []
n = []
squares = []
vsquare = (-60)*(-60)
sumsquares = 0
rms = []
a= []
dt = 0.05
t = np.linspace(0,100,length)
#constants
Cm = area#microFarad
ENa=50 #miliVolt
EK=-77 #miliVolt
El=-54 #miliVolt
g_Na=120*area #mScm-2
g_K=36*area #mScm-2
g_l=0.03*area #mScm-2
def alphaN(v):
return 0.01*(v+50)/(1-np.exp(-(v+50)/10))
def betaN(v):
return 0.125*np.exp(-(v+60)/80)
def alphaM(v):
return 0.1*(v+35)/(1-np.exp(-(v+35)/10))
def betaM(v):
return 4.0*np.exp(-0.0556*(v+60))
def alphaH(v):
return 0.07*np.exp(-0.05*(v+60))
def betaH(v):
return 1/(1+np.exp(-(0.1)*(v+30)))
#Initialize the voltage and the channels :
v.append(-60)
rms.append(1)
m0 = alphaM(v[0])/(alphaM(v[0])+betaM(v[0]))
n0 = alphaN(v[0])/(alphaN(v[0])+betaN(v[0]))
h0 = alphaH(v[0])/(alphaH(v[0])+betaH(v[0]))
#t.append(0)
m.append(m0)
n.append(n0)
h.append(h0)
#solving ODE using Euler's method:
for i in range(1,len(t)):
m.append(m[i-1] + dt*((alphaM(v[i-1])*(1-m[i-1]))-betaM(v[i-1])*m[i-1]))
n.append(n[i-1] + dt*((alphaN(v[i-1])*(1-n[i-1]))-betaN(v[i-1])*n[i-1]))
h.append(h[i-1] + dt*((alphaH(v[i-1])*(1-h[i-1]))-betaH(v[i-1])*h[i-1]))
gNa = g_Na * h[i-1]*(m[i-1])**3
gK=g_K*n[i-1]**4
gl=g_l
INa = gNa*(v[i-1]-ENa)
IK = gK*(v[i-1]-EK)
Il=gl*(v[i-1]-El)
v.append(v[i-1]+(dt)*((1/Cm)*(I[i-1]-(INa+IK+Il))))
#v.append(v[i-1]+(dt)*((1/Cm)*(I-(INa+IK+Il))))
meansquare = np.sqrt((np.square(v).sum()))
return v,area,meansquare
spikeEvents = [] #timing each spike
length = 1000*5 #the time period
fluctuations = []
output = []
for j in range(1, 10):
barcode = np.zeros(length)
noisyI = np.random.normal(0,9,length)
area = 1.0+0.1*j
res = HHmodel(noisyI,length,area)
output.append(res[2])
print('Done.')
The goal should be that the fluctuations of v increase in some way with the size of the are a - I was thinking here of the rms amplitude as a reasonable measure
BR
edit:
for i in range(1,len(t)):
m.append(m[i-1] + dt*((alphaM(v[i-1])*(1-m[i-1]))-betaM(v[i-1])*m[i-1]))
n.append(n[i-1] + dt*((alphaN(v[i-1])*(1-n[i-1]))-betaN(v[i-1])*n[i-1]))
h.append(h[i-1] + dt*((alphaH(v[i-1])*(1-h[i-1]))-betaH(v[i-1])*h[i-1]))
gNa = g_Na * h[i-1]*(m[i-1])**3
gK=g_K*n[i-1]**4
gl=g_l
INa = gNa*(v[i-1]-ENa)
IK = gK*(v[i-1]-EK)
Il=gl*(v[i-1]-El)
v.append(v[i-1]+(dt)*((1/Cm)*(I[i-1]-(INa+IK+Il))))
z.append(v[i-1]-np.mean(v))
#v.append(v[i-1]+(dt)*((1/Cm)*(I-(INa+IK+Il))))
mean = sum(np.square(v))/len(v)
squared_diffs =[(item-mean)**2 for item in v]
ms_diff = sum(squared_diffs)/len(squared_diffs)
rms_diff =np.sqrt(ms_diff)
return v,area,rms_diff
edit2:
Plot for j in range(1,10) - blue: rmsvalue as calculated in edit 1, yellow 1/sqrt(j)
edit3:
Plot for j in range(1,100) - but the "size" of fluctuations should increase, and not decrease and center somewhere
A few minor notes:
So, basically your "function" v is a one-timestep discrete evaluation of some function rather than a true function, but that's not really relevant here.
As indicated by comments above, you should calculate v for all timesteps and aggregate the squared values, then sum them outside of the loop and normalize by dividing by len(v).
It is also unclear why in iteration i you calculate v[i] but the corresponding squared value you calculate is v[i-1] squared. Should use same index on same loop iteration or you'll likely end up missing an element.
I would say that the reason that the result is not useful is that root-mean square is not really ever used for a function's outputs (RMS in this case is just some sort of less useful mean that gives extra weight to outliers); rather RMS is generally used on the error or variance of that function's outputs. RMS error or variance tells you how far, in the function's original units, does the average function value differ from the average value?). Note that this is only really an imporant metric if you expect the value of v to be constant.
Given all this, it's hard to say from your question what your intention is and what you're actually trying to do with this info so I will guess that what you really care about is how much the value of v is varying from the mean. In this case, you can use RMS difference from mean value of v calculated as such:
for i in range(1,len(t)):
#calculate v[i] here, omitted for simplicity
# get mean value
mean = sum(squares)/len(squares)
# you want to get the squared value of the difference, not the value itself
squared_diffs = [(item - mean)**2 for item in v)]
# get mean squared diff
ms_diff = sum(squared_diffs) / len(squared_diffs)
# return root of mean squared diff
rms_diff = np.sqrt(ms_diff)
return v,area,rms_diff
Again, this is only useful if you expect the outputs of v to be a constant. If not, you would try to fit a different model (linear, quadratic, etc.) to the function and then calculate the RMS error. Question would be much clearer if you indicated goal of this calculation.

Fourier's fit coefficients

lately i am been working fitting a fourier series function to a periodic signal for retrieve the amplitude and the phase of each component via least squares, so i modified the code of this file for it:
import math
import numpy as np
#period of the signal
per=1.0
w = 2.0*np.pi/per
#number of fourier components.
nf = 5
fp = open("file.cat","r")
# m1 is the number of unknown coefficients.
m1 = 2*nf + 1
# Create empty matrices.
x = np.zeros((m1,m1))
y = np.zeros((m1,1))
xi = [0.0]*m1
# Read (time, value) from each line of the file.
for line in fp:
t = float(line.split()[0])
yi = float(line.split()[1])
xi[0] = 1.0
for k in range(1,nf+1):
xi[2*k-1] = np.sin(k*w*t)
xi[2*k] = np.cos(k*w*t)
for j in range(m1):
for k in range(m1):
x[j,k] += xi[j]*xi[k]
y[j] += yi*xi[j]
fp.close()
# Copy to big matrices.
X = np.mat( x.copy() )
Y = np.mat( y.copy() )
# Invert X and multiply by Y to get coefficients.
A = X.I*Y
A0 = A[0]
# Solution is A0 + Sum[ Amp*sin(k*wt + phi) ]
print "a[0] = %f" % A[0]
for k in range(1,nf+1):
amp = math.sqrt(A[2*k-1]**2 + A[2*k]**2)
phs = math.atan2(A[2*k],A[2*k-1])
print "amp[%d] = %f phi = %f" % (k, amp, phs)
but the plot show this (without the points, of course):
and it should show something like this:
somebody can tell me how can i compute the phase and the amplitude in another simpler way? a guide maybe, i will be very grateful.
cheers!
PD. I will attach the FILE that i used, just because :)
EDITED
The error was with a index :(
First, I defined the vector with the values:
amp = np.array([np.sqrt((A[2*k-1])**2 + (A[2*k])**2) for k in range(1,nf+1)])
phs = np.array([math.atan2(A[2*k],A[2*k-1]) for k in range(1,nf+1)])
and then, to build the signal, I defined:
def term(t): return np.array([amp[k]*np.sin(k*w*t + phs[k]) for k in range(len(amp))])
Signal = np.array([A0+sum(term(phase[i])) for i in range(len(mag))])
but within the np.sin(), k should be k+1, because the index start in 0 ·__·
def term(t): return np.array([amp[k]*np.sin((k+1)*w*t + phs[k]) for k in range(len(amp))])
plt.plot(phase,Signal,'r-',lw=3)
and that is all.
Thanks Marco Tompitak for the help!!
You're specifying the wrong period for the signal:
#period of the signal
per=0.178556
This gives you the resulting Fourier fit, indeed with a maximum period of ~0.17. The problem is that this number specifies the longest period that is present in your Fourier series. The function only has components with perior 0.17 or shorter. Apparently you are expecting a fit with period ~1, so it can never approximate that properly. You should specify per=1.0. There's nothing wrong with the algorithm; a quick writeup of a similar algorithm in Mathematica gives the same output and plausible results:

How do you compute the confidence interval for Pearson's r in Python?

In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)
According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:
Convert r to z',
Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
Convert the confidence interval back to r.
Here are some sample codes:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Reference:
http://onlinestatbook.com/2/estimation/correlation_ci.html
Using rpy2 and the psychometric library (you will need R installed and to run install.packages("psychometric") within R first)
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
Where 0.9 is your correlation, n the sample size and 0.95 the confidence level
Here's a solution that uses bootstrapping to compute the confidence interval, rather than the Fisher transformation (which assumes bivariate normality, etc.), borrowing from this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
Answer given by bennylp is mostly correct, however, there is a small error in calculating the critical value in the 3rd function.
It should instead be:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Here's another post for reference: Scipy - two tail ppf function for a z value?
I know bootstrapping has been suggested above, proposing another variation of it below, which may suit some other set ups better.
#1
Sample your data (paired X & Ys and can also add other say weight) , fit original model on it, record r2, append it. Then extract your confidence intervals from your distribution of all R2s recorded.
#2 Additionally can fit on sampled data and using sampled data model predict on non sampled X (could also supply a continuous range to extend your predictions instead of using original X)
to get confidence intervals on your Y hats.
So in sample code:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"

Categories