I am attempting to fit a function to a set of data I have. The function in question is:
x(t) = - B + sqrt(AB(t-t0) + (x0 + B)^2)
I have tried to fit my data (included at the bottom) to this using two different methods but have found that whatever I do the fit for B is extremely unstable. Changing either the method or the initial guess wildly changes the output value. In addition, when I look at the error for this fit using curve_fit the error is almost two orders of magnitude higher than the value. Does anyone have some suggestions on what I should do to decrease the error?
import numpy as np
import scipy.optimize as spo
def modelFun(t,A,B):
return -B + np.sqrt(A*B*(t-t0) + np.power(x0 + B,2))
def errorFun(k,time,data):
A = k[0]
B = k[1]
return np.sum((data-modelFun(time,A,B))**2)
data = np.genfromtxt('testdata.csv',delimiter=',',skip_header = 1)
time = data[:,0]
xt = data[:,1]
t0 = data[0,0]
x0 = data[0,1]
minErrOut = spo.minimize(errorFun,[1,1000],args=(time,xt),bounds=((0,None),(0,None)))
(curveOut, curveCovar) = spo.curve_fit(modelFun,time,xt,p0=[1,1000],method='dogbox',bounds=([-np.inf,0],np.inf))
print('minimize result: A={}; B={}'.format(*minErrOut.x))
print('curveFit result: A={}; B={}'.format(*curveOut))
print('curveFit Error: A={}; B={}'.format(*np.sqrt(np.diag(curveCovar))))
Datafile:
Time,x
201,2.67662
204,3.28159
206,3.44378
208,3.72537
210,3.94826
212,4.36716
214,4.65373
216,5.26766
219,5.59502
221,6
223,6.22189
225,6.49652
227,6.799
229,7.30846
231,7.54229
232,7.76517
233,7.6209
234,7.89552
235,7.94826
236,8.17015
237,8.66965
238,8.66965
239,8.8398
240,8.88856
241,9.00697
242,9.45075
243,9.51642
244,9.63483
245,9.63483
246,10.07861
247,10.02687
248,10.24876
249,10.31443
250,10.47164
251,10.99502
252,10.92935
253,11.0995
254,11.28358
255,11.58209
256,11.53035
257,11.62388
258,11.93632
259,11.98806
260,12.26269
261,12.43284
262,12.60299
263,12.801
264,12.99502
265,13.08557
266,13.25572
267,13.32139
268,13.57114
269,13.76617
270,13.88358
271,13.83184
272,14.10647
273,14.27662
274,14.40796
TL;DR;
Your dataset is linear and misses observations at larger timescale. Therefore you can capture A which is proportional to the slope while your model needs to keep B large (and potentially unbounded) to inhibit the square root trend.
This can be confirmed by developing Taylor series of your model and analyzing the MSE surface associated to the regression.
In a nutshell, considering this kind of dataset and the given model, accept A don't trust B.
MCVE
First, let's reproduce your problem:
import io
import numpy as np
import pandas as pd
from scipy import optimize
stream = io.StringIO("""Time,x
201,2.67662
204,3.28159
...
273,14.27662
274,14.40796""")
data = pd.read_csv(stream)
# Origin Shift:
data = data.sub(data.iloc[0,:])
data = data.set_index("Time")
# Simplified model:
def model(t, A, B):
return -B + np.sqrt(A*B*t + np.power(B, 2))
# NLLS Fit:
parameters, covariance = optimize.curve_fit(model, data.index, data["x"].values, p0=(1, 1000), ftol=1e-8)
# array([3.23405915e-01, 1.59960168e+05])
# array([[ 3.65068730e-07, -3.93410484e+01],
# [-3.93410484e+01, 9.77198860e+12]])
The adjustment is fair:
But as you noticed model parameters differ from several order of magnitude which can prevent optimization to perform properly.
Notice that your dataset is quite linear. The observed effect is not surprising and is inherent the chosen model. B parameter must be several orders of magnitude bigger than A to keep the linear behaviour.
This claim is supported by the analysis of the first terms of the Taylor series:
def taylor(t, A, B):
return (A/2*t - A**2/B*t**2/8)
parameters, covariance = optimize.curve_fit(taylor, data.index, data["x"].values, p0=(1, 100), ftol=1e-8)
parameters
# array([3.23396685e-01, 1.05237134e+09])
Without surprise the slope of your linear dataset can be captured while the parameter B can be arbitrary large and will cause float arithmetic errors during optimization (hence the minimize warning bellow you have got).
Analyzing Error Surface
The problem can be reformulated as a minimization problem:
def objective(beta, t, x):
return np.sum(np.power(model(t, beta[0], beta[1]) - x, 2))
result = optimize.minimize(objective, (1, 100), args=(data.index, data["x"].values))
# fun: 0.6594398116927569
# hess_inv: array([[8.03062155e-06, 2.94644208e-04],
# [2.94644208e-04, 1.14979735e-02]])
# jac: array([2.07304955e-04, 6.40749931e-07])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 389
# nit: 50
# njev: 126
# status: 2
# success: False
# x: array([3.24090627e-01, 2.11891188e+03])
If we plot the MSE associated to your dataset, we get the following surface:
We have a canyon that is narrow on A space but seems unbounded at least for first decades on B space. This is supporting your observations in your post and comments. It also brings a technical insight on why we cannot fit B properly.
Performing the same operation on synthetic dataset:
t = np.linspace(0, 1000, 100)
x = model(t, 0.35, 20)
data = pd.DataFrame(x, index=t, columns=["x"])
To have the square root shape in addition of the linear trend at the beginning.
result = optimize.minimize(objective, (1, 0), args=(data.index, data["x"].values), tol=1e-8)
# fun: 1.9284246829733202e-10
# hess_inv: array([[ 4.34760333e-05, -4.42855253e-03],
# [-4.42855253e-03, 4.59219063e-01]])
# jac: array([ 4.35726463e-03, -2.19158602e-05])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 402
# nit: 94
# njev: 130
# status: 2
# success: False
# x: array([ 0.34999987, 20.000013 ])
This version of the problem has following MSE surface:
Showing a potential convex valley around the known solution which explain why you are able to fit both parameters when there are sufficient large time acquisition.
Notice the valley is strongly stretched meaning that in this scenario your problem will benefit from normalization.
What is the null hypothesis behind an OLSResults's f_pvalue attribute? This docstring is not particularly useful.
At first I thought the null hypothesis was that all estimated coefficients are simultaneously zero (including the constant term). However, I am starting to think that the hypothesis being tested is that all estimated parameters except for the constant term are simultaneously zero (i.e. b1 = b2 = ... = bp = 0, excluding b0).
For example, suppose y is an array of targets and X is a numpy matrix of features (a constant term and p features).
# Silly example
from statsmodels.api import OLS
m = OLS(endog=y, exog=X).fit()
# What is being tested here?
print(m.f_pvalue)
Does anyone know what the null hypothesis is?
Thanks to #Josef for clearing things up. As per the documentation:
F-statistic of the fully specified model.
Calculated as the mean squared error of the model divided by the mean squared error of the residuals if the nonrobust covariance is used. Otherwise computed using a Wald-like quadratic form that tests whether all coefficients (excluding the constant) are zero.
And just to prove that this is the case:
# Libraries
import numpy as np
import pandas as pd
from statsmodels.api import OLS
from sklearn.datasets import load_boston
# Load target
y = pd.DataFrame(load_boston()['target'], columns=['price'])
# Load features
X = pd.DataFrame(load_boston()['data'], columns=load_boston()['feature_names'])
# Add constant
X['CONST'] = 1
# One feature
m1 = OLS(endog=y, exog=X[['CONST','CRIM']]).fit()
print(f'm1 pvalue: {m1.f_pvalue}')
# Multiple features
m2 = OLS(endog=y, exog=X[['CONST','CRIM','AGE']]).fit()
print(f'm2 pvalue: {m2.f_pvalue}')
# Manually test H0: all coefficients are zero (excluding b0)
print('Manual F-test for m1', m1.f_test(r_matrix=np.matrix([[0,0],[0,1]])),
'Manual F-test for m2', m2.f_test(r_matrix=np.matrix([[0,0,0],[0,1,0],[0,0,1]])),
sep='\n')
# Output
"""
> m1 pvalue: 1.1739870821944483e-19
> m2 pvalue: 2.2015246345918656e-27
> Manual F-test for m1
> <F test: F=array([[89.48611476]]), p=1.1739870821945733e-19, df_denom=504, df_num=1>
> Manual F-test for m2
> <F test: F=array([[69.51929476]]), p=2.2015246345920063e-27, df_denom=503, > df_num=2>
"""
So yes, f_pvalue matches the p-value of manually entering the null hypothesis.
I'm trying to create some material for introductory statistics for a seminar. The above code computes a 95% confidence interval for estimating the mean, but the result is not the same from the one implemented in Python. Is there something wrong with my math / code? Thanks.
EDIT:
Data was sampled from here
import pandas as pd
import numpy as np
x = np.random.normal(60000,15000,200)
income = pd.DataFrame()
income = pd.DataFrame()
income['Data Scientist'] = x
# Manual Implementation
sample_mean = income['Data Scientist'].mean()
sample_std = income['Data Scientist'].std()
standard_error = sample_std / (np.sqrt(income.shape[0]))
print('Mean',sample_mean)
print('Std',sample_std)
print('Standard Error',standard_error)
print('(',sample_mean-2*standard_error,',',sample_mean+2*standard_error,')')
# Python Library
import scipy.stats as st
se = st.sem(income['Data Scientist'])
a = st.t.interval(0.95, len(income['Data Scientist'])-1, loc=sample_mean, scale=se)
print(a)
print('Standard Error from this code block',se)
You've got 2 errors.
First, you are using 2 for the multiplier for the CI. The more accurate value is 1.96. "2" is just a convenient estimator. That is making your CI generated manually too fat.
Second, you are comparing a normal distribution to the t-distribution. This probably isn't causing more than decimal dust in difference because you have 199 degrees of freedom for the t-dist, which is basically the normal.
Below is the z-score of 1.96 and computation of CI with apples-to-apples comparison to the norm distribution vs. t.
In [45]: st.norm.cdf(1.96)
Out[45]: 0.9750021048517795
In [46]: print('(',sample_mean-1.96*standard_error,',',sample_mean+1.96*standard_error,')')
( 57558.007862202685 , 61510.37559873406 )
In [47]: st.norm.interval(0.95, loc=sample_mean, scale=se)
Out[47]: (57558.044175045005, 61510.33928589174)
I am trying to subtract every element in the column from its mean and divide by the standard deviation. I did it in two different ways (numeric_data1 and numeric_data2):
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
type(numeric_data1) # -> pandas.core.frame.DataFrame
type(numeric_data2) # -> pandas.core.frame.DataFrame
Both are pandas dataframes and they should have the same result. However, I get different results:
numeric_data2 == numeric_data1 # -> False
I think the problem stems from how numpy and pandas handle numeric precision:
numeric_data.mean() == np.mean(numeric_data, axis=0) # -> True
numeric_data.std(axis=0) == np.std(numeric_data, axis=0) # -> False
For mean numpy and pandas gave me the same thing, but for standard deviation, I got little different results.
Is my assessment correct or am I making some blunder?
When calculating the standard deviation it matters whether you are estimating the standard deviation of an entire population with a smaller sample of that population or are you calculating the standard deviation of the entire population.
If it is a smaller sample of a larger population, you need what is called the sample standard deviation. As it turns out, when you divide the sum of squared differences from the mean by the number of observations, you end up with a biased estimator. We correct for that by dividing by one less than the number of observations. We control for this with the argument ddof=1 for sample standard deviation or ddof=0 for population standard deviation.
Truth is, it doesn't matter much if your sample size is large. But you will see small differences.
Use the degrees of freedom argument in your pandas.DataFrame.std call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std(ddof=0)) # <<<
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0))
np.isclose(numeric_data1, numeric_data2).all() # -> True
Or in the np.std call:
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
numeric_data = data.drop("color", 1)
numeric_data1 = ((numeric_data - numeric_data.mean()) /
numeric_data.std())
numeric_data2 = ((numeric_data - np.mean(numeric_data, axis=0)) /
np.std(numeric_data, axis=0, ddof=1)) # <<<
np.isclose(numeric_data1, numeric_data2).all() # -> True
I am attempting to generate a random probability density function of QSO's of certain luminosity with the form:
1/( (L/L_B^* )^alpha + (L/L_B^* )^beta )
where L_B^*, alpha, and beta are all constants. To do this, the following code is used:
import scipy.stats as st
logLbreak = 43.88
alpha = 3.4
beta = 1.6
class my_pdf(st.rv_continuous):
def _pdf(self,l_L):
#"l_L" in this is always log L
L = 10**(l_L/logLbreak)
D = 1/(L**alpha + L**beta)
return D
dist_Log_L = my_pdf(momtype = 0, a = 0,name='l_L_dist')
distro = dist_Log_L.rvs(size = 10000)
(L/L^* is rased to a power of 10 since everything is being done in a log scale)
The distribution is supposed to produce a graph that approximates this, trailing off to infinity, but in reality the graph it produces looks like this (10,000 samples). The upper bound is the same regardless of the amount of samples that are used. Is there a reason it is being restricted in the way it is?
Your PDF is not properly normalized. The integral of a PDF over the domain must be 1. Your PDF integrates to approximately 3.4712:
In [72]: from scipy.integrate import quad
In [73]: quad(dist_Log_L._pdf, 0, 100)
Out[73]: (3.4712183965415373, 2.0134487716044682e-11)
In [74]: quad(dist_Log_L._pdf, 0, 800)
Out[74]: (3.4712184965748905, 2.013626296581202e-11)
In [75]: quad(dist_Log_L._pdf, 0, 1000)
Out[75]: (3.47121849657489, 8.412130378805368e-10)
This will break the class's implementation of inverse transform sampling. It will only generate samples from the domain up to where the integral of the PDF from 0 to x first reaches 1.0, which in your case is about 2.325
In [81]: quad(dist_Log_L._pdf, 0, 2.325)
Out[81]: (1.0000875374350238, 1.1103202107010366e-14)
That is, in fact, what you see in your histogram.
As a quick fix to verify the issue, I modified the return statement of the _pdf() method to:
return D/3.47121849657489
and ran your script again. (In a real fix, that value will be a function of the other parameters.) Then the commands
In [85]: import matplotlib.pyplot as plt
In [86]: plt.hist(distro, bins=31)
generates this plot: