computing an integral using an empirical integrand - python

I have an empirical probability function p(z). In the first column z and the second column contains p(z) values. The data is given as following :
data.cat
+0.01234 +0.002816
+0.03693 +0.003265
+0.06152 +0.003551
+0.08611 +0.006612
+0.1107 +0.008898
+0.1353 +0.01041
+0.1599 +0.01269
+0.1845 +0.01404
+0.2091 +0.01616
+0.2336 +0.01657
+0.2582 +0.01865
+0.2828 +0.01951
+0.3074 +0.02024
+0.332 +0.02131
+0.3566 +0.0222
+0.3812 +0.02306
+0.4058 +0.02241
+0.4304 +0.02347
+0.4549 +0.02461
+0.4795 +0.02306
+0.5041 +0.02298
+0.5287 +0.02212
+0.5533 +0.02392
+0.5779 +0.02118
+0.6025 +0.02359
+0.6271 +0.024
+0.6517 +0.02196
+0.6762 +0.02155
+0.7008 +0.02314
+0.7254 +0.02037
+0.75 +0.02065
+0.7746 +0.0211
+0.7992 +0.02037
+0.8238 +0.0198
+0.8484 +0.01984
+0.873 +0.01959
+0.8976 +0.01869
+0.9221 +0.01873
+0.9467 +0.01861
+0.9713 +0.01714
+0.9959 +0.01739
+1.02 +0.01678
+1.045 +0.01633
+1.07 +0.01624
+1.094 +0.01543
+1.119 +0.01494
+1.143 +0.01547
+1.168 +0.01445
+1.193 +0.01384
+1.217 +0.01339
+1.242 +0.01384
+1.266 +0.01298
+1.291 +0.0109
+1.316 +0.0122
+1.34 +0.0111
+1.365 +0.0109
+1.389 +0.009592
+1.414 +0.01114
+1.439 +0.0111
+1.463 +0.009061
+1.488 +4.082e-05
I have to compute the following integral using the empirical probability density by some kind of interpolation:
or it can be re-written as
where is defined as
and a is given as
I am wondering how I could compute this complicated integral regarding the existence of an empirical probability density function in the middle?
Do I need to do some kind of interpolation?

Related

Determining Fourier Coefficients from Time Series Data

I asked a since deleted question regarding how to determine Fourier coefficients from time series data. I am resubmitting this because I have better formulated the problem and have a solution that I'll give as I think others may find this very useful.
I have some time series data that I have binned into equally spaced time bins (a fact which will be crucial to my solution), and from that data I want to determine the Fourier series (or any function, really) that best describes the data. Here is a MWE with some test data to show the data I'm trying to fit:
import numpy as np
import matplotlib.pyplot as plt
# Create a dependent test variable to define the x-axis of the test data.
test_array = np.linspace(0, 1, 101) - 0.5
# Define some test data to try to apply a Fourier series to.
test_data = [0.9783883464566918, 0.979599093567252, 0.9821424606299206, 0.9857575507812502, 0.9899278899999995,
0.9941848228346452, 0.9978438300395263, 1.0003009205426352, 1.0012208923679058, 1.0017130521235522,
1.0021799664031628, 1.0027475606936413, 1.0034168260869563, 1.0040914266144825, 1.0047781181102355,
1.005520348837209, 1.0061899214145387, 1.006846206627681, 1.0074483048543692, 1.0078691461988312,
1.008318736328125, 1.008446947572815, 1.00862051262136, 1.0085134881422921, 1.008337095516569,
1.0079539881889774, 1.0074857334630352, 1.006747783037474, 1.005962048923679, 1.0049115434782612,
1.003812267822736, 1.0026427549407106, 1.001251963531669, 0.999898555335968, 0.9984976286266923,
0.996995982142858, 0.9955652088974847, 0.9941647321428578, 0.9927727076023389, 0.9914750532544377,
0.990212467710371, 0.9891098035363466, 0.9875998927875242, 0.9828093773946361, 0.9722532524271845,
0.9574084365384614, 0.9411012303149601, 0.9251820309477757, 0.9121488392156851, 0.9033119748549322,
0.9002445803921568, 0.9032760564202343, 0.91192435882353, 0.9249696964980555, 0.94071381372549,
0.957139088974855, 0.9721083392156871, 0.982955287937743, 0.9880613320235758, 0.9897455322896282,
0.9909590626223097, 0.9922601592233015, 0.9936513112840472, 0.9951442427184468, 0.9967071285988475,
0.9982921493123781, 0.9998775465116277, 1.001389230174081, 1.0029109110251453, 1.0044033691406251,
1.0057110841487276, 1.0069551867704276, 1.008118776264591, 1.0089884470588228, 1.0098663972602735,
1.0104514566473979, 1.0109849223300964, 1.0112043902912626, 1.0114717968750002, 1.0113343036750482,
1.0112205972495087, 1.0108811786407768, 1.010500276264591, 1.0099054552529192, 1.009353759223301,
1.008592596116505, 1.007887223091976, 1.0070715634615386, 1.0063525891472884, 1.0055587861271678,
1.0048733732809436, 1.0041832862669238, 1.0035913326848247, 1.0025318871595328, 1.000088536345776,
0.9963596140350871, 0.9918380684931506, 0.9873937281553398, 0.9833394624277463, 0.9803621496062999,
0.9786476100386117]
# Create a figure to view the data.
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
# Plot the data.
ax.scatter(test_array, test_data, color="k", s=1)
This outputs the following:
The question is how to determine the Fourier series best describing this data. The usual formula for determining the Fourier coefficients requires inserting a function into an integral, but if I had a function to describe the data I wouldn't need the Fourier coefficients at all; the whole point of finding this series is to have a functional representation of the data. In the absence of such a function, then, how are the coefficients found?
My solution to this problem is to apply a discrete Fourier transform to the data using NumPy's implementation of the Fast Fourier Transform, numpy.fft.fft(); this is why it's critical that the data is evenly spaced in time, as FFT requires this. While the FFT is typically used to perform analysis of the frequency spectrum, the desired Fourier coefficients are directly related to the output of this function.
Specifically, this function outputs a series of i complex-valued coefficients c. The Fourier series coefficients are found using the relations:
Therefore the FFT allows the Fourier coefficients to be directly computed. Here is the MWE of my solution to this problem, expanding the example given above:
import numpy as np
import matplotlib.pyplot as plt
# Set the number of equal-time bins to create.
n_bins = 101
# Set the number of Fourier coefficients to use.
n_coeff = 51
# Define a function to generate a Fourier series based on the coefficients determined by the Fast Fourier Transform.
# This also includes a series of phases x to pass through the function.
def create_fourier_series(x, coefficients):
# Begin the series with the zeroeth-order Fourier coefficient.
fourier_series = coefficients[0][0] / 2
# Now generate the first through n_coeff'th terms. The period is defined to be 1 since we're operating in phase
# space.
for n in range(1, n_coeff):
fourier_series += (fourier_coeff[n][0] * np.cos(2 * np.pi * n * x) + fourier_coeff[n][1] *
np.sin(2 * np.pi * n * x))
return fourier_series
# Create a dependent test variable to define the x-axis of the test data.
test_array = np.linspace(0, 1, n_bins) - 0.5
# Define some test data to try to apply a Fourier series to.
test_data = [0.9783883464566918, 0.979599093567252, 0.9821424606299206, 0.9857575507812502, 0.9899278899999995,
0.9941848228346452, 0.9978438300395263, 1.0003009205426352, 1.0012208923679058, 1.0017130521235522,
1.0021799664031628, 1.0027475606936413, 1.0034168260869563, 1.0040914266144825, 1.0047781181102355,
1.005520348837209, 1.0061899214145387, 1.006846206627681, 1.0074483048543692, 1.0078691461988312,
1.008318736328125, 1.008446947572815, 1.00862051262136, 1.0085134881422921, 1.008337095516569,
1.0079539881889774, 1.0074857334630352, 1.006747783037474, 1.005962048923679, 1.0049115434782612,
1.003812267822736, 1.0026427549407106, 1.001251963531669, 0.999898555335968, 0.9984976286266923,
0.996995982142858, 0.9955652088974847, 0.9941647321428578, 0.9927727076023389, 0.9914750532544377,
0.990212467710371, 0.9891098035363466, 0.9875998927875242, 0.9828093773946361, 0.9722532524271845,
0.9574084365384614, 0.9411012303149601, 0.9251820309477757, 0.9121488392156851, 0.9033119748549322,
0.9002445803921568, 0.9032760564202343, 0.91192435882353, 0.9249696964980555, 0.94071381372549,
0.957139088974855, 0.9721083392156871, 0.982955287937743, 0.9880613320235758, 0.9897455322896282,
0.9909590626223097, 0.9922601592233015, 0.9936513112840472, 0.9951442427184468, 0.9967071285988475,
0.9982921493123781, 0.9998775465116277, 1.001389230174081, 1.0029109110251453, 1.0044033691406251,
1.0057110841487276, 1.0069551867704276, 1.008118776264591, 1.0089884470588228, 1.0098663972602735,
1.0104514566473979, 1.0109849223300964, 1.0112043902912626, 1.0114717968750002, 1.0113343036750482,
1.0112205972495087, 1.0108811786407768, 1.010500276264591, 1.0099054552529192, 1.009353759223301,
1.008592596116505, 1.007887223091976, 1.0070715634615386, 1.0063525891472884, 1.0055587861271678,
1.0048733732809436, 1.0041832862669238, 1.0035913326848247, 1.0025318871595328, 1.000088536345776,
0.9963596140350871, 0.9918380684931506, 0.9873937281553398, 0.9833394624277463, 0.9803621496062999,
0.9786476100386117]
# Determine the fast Fourier transform for this test data.
fast_fourier_transform = np.fft.fft(test_data[n_bins / 2:] + test_data[:n_bins / 2])
# Create an empty list to hold the values of the Fourier coefficients.
fourier_coeff = []
# Loop through the FFT and pick out the a and b coefficients, which are the real and imaginary parts of the
# coefficients calculated by the FFT.
for n in range(0, n_coeff):
a = 2 * fast_fourier_transform[n].real / n_bins
b = -2 * fast_fourier_transform[n].imag / n_bins
fourier_coeff.append([a, b])
# Create the Fourier series approximating this data.
fourier_series = create_fourier_series(test_array, fourier_coeff)
# Create a figure to view the data.
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
# Plot the data.
ax.scatter(test_array, test_data, color="k", s=1)
# Plot the Fourier series approximation.
ax.plot(test_array, fourier_series, color="b", lw=0.5)
This outputs the following:
Note that how I defined the FFT (importing the second half of the data followed by the first half) is a consequence of how this data was generated. Specifically, the data runs from -0.5 to 0.5, but the FFT assumes it runs from 0.0 to 1.0, necessitating this shift.
I've found that this works quite well for data that doesn't include very sharp and narrow discontinuities. I would be interested to hear if anyone has another suggested solution to this problem, and I hope people find this explanation clear and helpful.
Not sure if it helps you in anyway; I wrote a programme to interpoplate your data. This is done using buildingblocks==0.0.15
Please see below,
import matplotlib.pyplot as plt
from buildingblocks import bb
import numpy as np
Ydata = [0.9783883464566918, 0.979599093567252, 0.9821424606299206, 0.9857575507812502, 0.9899278899999995,
0.9941848228346452, 0.9978438300395263, 1.0003009205426352, 1.0012208923679058, 1.0017130521235522,
1.0021799664031628, 1.0027475606936413, 1.0034168260869563, 1.0040914266144825, 1.0047781181102355,
1.005520348837209, 1.0061899214145387, 1.006846206627681, 1.0074483048543692, 1.0078691461988312,
1.008318736328125, 1.008446947572815, 1.00862051262136, 1.0085134881422921, 1.008337095516569,
1.0079539881889774, 1.0074857334630352, 1.006747783037474, 1.005962048923679, 1.0049115434782612,
1.003812267822736, 1.0026427549407106, 1.001251963531669, 0.999898555335968, 0.9984976286266923,
0.996995982142858, 0.9955652088974847, 0.9941647321428578, 0.9927727076023389, 0.9914750532544377,
0.990212467710371, 0.9891098035363466, 0.9875998927875242, 0.9828093773946361, 0.9722532524271845,
0.9574084365384614, 0.9411012303149601, 0.9251820309477757, 0.9121488392156851, 0.9033119748549322,
0.9002445803921568, 0.9032760564202343, 0.91192435882353, 0.9249696964980555, 0.94071381372549,
0.957139088974855, 0.9721083392156871, 0.982955287937743, 0.9880613320235758, 0.9897455322896282,
0.9909590626223097, 0.9922601592233015, 0.9936513112840472, 0.9951442427184468, 0.9967071285988475,
0.9982921493123781, 0.9998775465116277, 1.001389230174081, 1.0029109110251453, 1.0044033691406251,
1.0057110841487276, 1.0069551867704276, 1.008118776264591, 1.0089884470588228, 1.0098663972602735,
1.0104514566473979, 1.0109849223300964, 1.0112043902912626, 1.0114717968750002, 1.0113343036750482,
1.0112205972495087, 1.0108811786407768, 1.010500276264591, 1.0099054552529192, 1.009353759223301,
1.008592596116505, 1.007887223091976, 1.0070715634615386, 1.0063525891472884, 1.0055587861271678,
1.0048733732809436, 1.0041832862669238, 1.0035913326848247, 1.0025318871595328, 1.000088536345776,
0.9963596140350871, 0.9918380684931506, 0.9873937281553398, 0.9833394624277463, 0.9803621496062999,
0.9786476100386117]
Xdata=list(range(0,len(Ydata)))
Xnew=list(np.linspace(0,len(Ydata),200))
Ynew=bb.interpolate(Xdata,Ydata,Xnew,40)
plt.figure()
plt.plot(Xdata,Ydata)
plt.plot(Xnew,Ynew,'*')
plt.legend(['Given Data', 'Interpolated Data'])
plt.show()
Should you want to further write code, I have also give code so that you can see the source code and learn:
import module
import inspect
src = inspect.getsource(module)
print(src)

Weights minimization issue with linprog

I am trying to use python (and at present failing) to come to a more efficient solution than Excel Solver provides for an optimization problem.
Matrices
The problem is the form AB=C -->D
Where AB produces C where the absolute value for C-D for each row in the matrix is minimized.
I have seven funds contained in matrix B all of which have geographic exposure of the form
FUND_NAME = np.array([UK,USA,EuroZone, Japan,EM,Apac)]
as below
RLS = np.array([0.788743177, 0.168048481,0,0.043208342,0,0])
LIOGLB=np.array([0.084313978,0.578528092,0,0.23641746,0.033709666,0.067030804])
LIONEUR=np.array([0.055032339,0,0,0.944967661,0,0])
STEW_WLDWD=np.array([0.09865472,0.210582713,0.053858632,0.431968002,0.086387178,0.118548755])
EMMK=np.array([0.080150377,0.025212864,0.597285513,0.031832241,0.212440426,0.053078578])
PAC=np.array([0,0.013177633,0.41273195,0,0.510644775,0.063445642])
PICTET=np.array([0.089520913,0.635857603,0,0.218148413,0.023290413,0.033182659])
From this I need to construct an optimal weighting of the seven funds using a matrix (imaginatively named A) [x1,x2,x3,x4,x5,x6,x7] with x1+x2+...+x7=1 & Also for i=(1,7)
xi lower bound =0
xi upper bound =0.25
To arrive at the actual regional weights (matrix C)as close as possible to the below Target array (which corresponds to matrix D above)
Target=np.array([0.2310,0.2576,0.1047,0.1832,0.1103,0.1131])
I've tried using libprog. But I know that the answer I am getting is wrong.
Funds =np.array([RLS,LIOGLB, LIONEUR,STEW_WLDWD, EMMK,PAC,PICTET])
twentyfive=np.full((1, 7), 0.25)
bounds=[0,0.25]
res = linprog(Target,A_ub=Funds,b_ub=twentyfive,bounds=[bounds])
Can anyone help me move on from excel ?
This is really a LAD regression problem (LAD=Least Absolute Deviation) with some side constraints. Different LP formulations for the LAD regression problems can be found here. Based on the sparse bounding problem, we can formulate the LP model:
This is the mathematical model I am going to try to solve with linprog. The coloring as as follows: blue symbols represent data, red symbols are the decision variables. x are the allocations (fractions) we need to find, d are the residuals of the linear fit and r are the absolute values of d.
linprog requires an explicit LP matrix. For the model above, this A matrix can look like:
With this it is no longer very difficult to develop a Python implementation. The Python code can look like:
import numpy as np
import scipy.optimize as sp
B = np.array([[0.788743177, 0.168048481,0,0.043208342,0,0],
[0.084313978,0.578528092,0,0.23641746,0.033709666,0.067030804],
[0.055032339,0,0,0.944967661,0,0],
[0.09865472,0.210582713,0.053858632,0.431968002,0.086387178,0.118548755],
[0.080150377,0.025212864,0.597285513,0.031832241,0.212440426,0.053078578],
[0,0.013177633,0.41273195,0,0.510644775,0.063445642],
[0.089520913,0.635857603,0,0.218148413,0.023290413,0.033182659]]).T
target = np.array([0.2310,0.2576,0.1047,0.1832,0.1103,0.1131])
m,n = np.shape(B)
A_eq = np.block([[B, np.eye(m), np.zeros((m,m))],
[np.ones(n), np.zeros(m), np.zeros(m)]])
A_ub = np.block([[np.zeros((m,n)),-np.eye(m), -np.eye(m)],
[np.zeros((m,n)),np.eye(m), -np.eye(m)]])
b_eq = np.block([target,1])
b_ub = np.zeros(2*m)
c = np.block([np.zeros(n),np.zeros(m),np.ones(m)])
bnd = n*[(0,0.25)] + m*[(None,None)] + m*[(0,None)]
res = sp.linprog(c,A_ub,b_ub,A_eq,b_eq,bnd,options={'disp':True})
allocation = res.x[0:n]
The results look like:
Primal Feasibility Dual Feasibility Duality Gap Step Path Parameter Objective
1.0 1.0 1.0 - 1.0 6.0
0.3777262386888 0.3777262386888 0.3777262386888 0.6478228594143 0.3777262386888 0.3200496644143
0.08438152300367 0.08438152300366 0.08438152300367 0.8087424108466 0.08438152300366 0.1335722585582
0.01563291142478 0.01563291142478 0.01563291142478 0.8341722620104 0.01563291142478 0.1118298108651
0.004083901923022 0.004083901923022 0.004083901923023 0.7432737130498 0.004083901923024 0.1049630948572
0.0006190254179117 0.0006190254179117 0.0006190254179116 0.8815177164943 0.000619025417913 0.1016021916581
3.504935403199e-05 3.504935403066e-05 3.504935403079e-05 0.9676694788778 3.504935402756e-05 0.1012177893279
5.983549975387e-07 5.98354980932e-07 5.983549810074e-07 0.9885372873161 5.983549719474e-07 0.1011921413019
3.056236812029e-11 3.056401712736e-11 3.056394819773e-11 0.9999489201822 3.056087926755e-11 0.1011915586046
Optimization terminated successfully.
Current function value: 0.101192
Iterations: 8
print(allocation)
[2.31621461e-01 2.50000000e-01 9.07425872e-12 2.50000000e-01
4.45030949e-10 2.39692743e-01 2.86857955e-02]

How to calculate a p-value for a point on a (normal) distribution?

I'm trying to calculate a p-value for my metric (spearman) that I'm trying to generalize the method so it can work with other metrics (instead of relying on scipy.stats.spearmanr).
How can I generate a p-value of a point from this distribution?
Does the method apply to non-normal distributions? This is normally distributed and would probaly be more-so if I sampled more than 100 points.
This post requires ยต=0 ,std=1 Convert Z-score (Z-value, standard score) to p-value for normal distribution in Python
from scipy import stats
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
data = np.asarray([0.027972027972027972, -0.2802197802197802, -0.21818181818181817, 0.3464285714285714, 0.15, 0.34065934065934067, -0.3216783216783217, 0.08391608391608392, -0.03496503496503497, -0.2967032967032967, 0.09090909090909091, 0.11188811188811189, 0.1181818181818182, -0.4787878787878788, -0.6923076923076923, -0.05494505494505495, 0.19090909090909092, 0.3146853146853147, -0.42727272727272725, 0.06363636363636363, 0.1978021978021978, 0.12142857142857141, 0.10303030303030303, 0.23214285714285712, -0.5804195804195805, 0.013986013986013986, 0.02727272727272727, 0.5659340659340659, 0.06363636363636363, -0.503030303030303, -0.2867132867132867, 0.07252747252747253, -0.13736263736263737, 0.21212121212121213, -0.09010989010989011, -0.2517482517482518, -0.17482517482517484, -0.3706293706293707, 0.15454545454545454, 0.01818181818181818, 0.17582417582417584, 0.3230769230769231, -0.09642857142857142, -0.5274725274725275, -0.23626373626373626, -0.2692307692307692, -0.2857142857142857, -0.19999999999999998, -0.489010989010989, -0.15454545454545454, 0.38461538461538464, 0.6, 0.37762237762237766, -0.0029411764705882353, -0.06993006993006994, -0.19999999999999998, 0.38181818181818183, 0.05454545454545455, -0.03296703296703297, 0.17272727272727273, -0.13986013986013987, -0.08241758241758242, -0.34545454545454546, 0.5252747252747253, 0.10303030303030303, 0.16783216783216784, -0.36363636363636365, -0.42857142857142855, 0.12727272727272726, -0.18181818181818182, -0.10439560439560439, -0.6083916083916084, -0.1956043956043956, 0.13846153846153847, -0.48951048951048953, -0.18881118881118883, 0.7362637362637363, -0.19090909090909092, 0.4909090909090909, 0.37142857142857144, -0.3090909090909091, -0.1098901098901099, 0.15151515151515152, -0.13636363636363635, -0.5494505494505495, 0.44755244755244755, 0.04895104895104896, -0.37142857142857144, 0.01098901098901099, 0.08131868131868132, 0.2571428571428571, -0.3076923076923077, 0.24545454545454545, 0.06043956043956044, 0.06764705882352941, 0.02727272727272727, -0.07252747252747253, 0.21818181818181817, -0.03846153846153846, 0.48571428571428577])
query_value = -0.44155844155844154
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
sns.distplot(data, rug=True, color="teal", ax=ax)
ax.set_xlabel("$x$", fontsize=15)
ax.axvline(query_value, color="black", linestyle=":", linewidth=1.618, label="Query: %0.5f"%query_value)
ax.legend()
# Normal Test
print(stats.normaltest(data))
# Fit the data
params = stats.norm.fit(data)
# Generate the distribution
distribution = stats.norm(*params)
distribution
Based on your comments I am going to assume that these are the results from a permutation test. That is, you obtained a value from your original data set (-0.44), while all the other values were obtained by permuting your data. Now you would like to determine whether your original value is significant.
A permutation test (in the branch of resampling) is a non-parametric statistic, so it has nothing to do with normal distributions. In your case it looks roughly normal but that is neither necessary nor required. There are different ways you could estimate a p-value from the permuted distribution, the simplest option is similar to your idea.
In case you performed every possible permutation you would get an exact distribution, so your formula for a (two-sided) p-value is correct, (|t*|>=|t|)/p, where t* is the original value, t are the permuted values, and p is the number of total permutations.
If you performed a non-complete number of permutations then the formula is only slightly different, (1+|t*|>=|t|)/(1+p), to account for the randomness.

Understanding the output of scipy.stats.multivariate_normal

I am trying to build a multidimensional gaussian model using scipy.stats.multivariate_normal. I am trying to use the output of scipy.stats.multivariate_normal.pdf() to figure out if a test value fits reasonable well in the observed distribution.
From what I understand, high values indicate a better fit to the given model and low values otherwise.
However, in my dataset, I see extremely large PDF(x) results, which lead me to question if I understand things correctly. The area under the PDF curve must be 1, so very large values are hard to comprehend.
For e.g., consider:
x = [-0.0007569417915494715, -0.01394295997613827, 0.000982078369890444, -0.03633664354397629, -0.03730583036106844, 0.013920453054506978, -0.08115836865224338, -0.07208494497398354, -0.06255237023298793, -0.0531888840386906, -0.006823760545565131]
mean = [0.01663645201261102, 0.07800335614699873, 0.016291452384234965, 0.012042931155488702, 0.0042637244100103885, 0.016531331606477996, -0.021702714746699842, -0.05738646649459681, 0.00921296058625439, 0.027940994009345254, 0.07548111758006244]
covariance = [[0.07921927017771506, 0.04780185747873293, 0.0788086850274493, 0.054129466248481264, 0.018799028456661045, 0.07523731808137141, 0.027682748950487425, -0.007296954729572955, 0.07935165417756569, 0.0569381100965656, 0.04185848489472492], [0.04780185747873293, 0.052300105044833595, 0.047749467098423544, 0.03254872837949123, 0.010582358713999951, 0.045792252383799206, 0.01969282984717051, -0.006089301208961258, 0.05067712814145293, 0.03146214776997301, 0.04452949330387575], [0.0788086850274493, 0.047749467098423544, 0.07841809405745602, 0.05374461924031552, 0.01871005609017673, 0.07487015790787396, 0.02756781074862818, -0.007327131572569985, 0.07895548129950304, 0.056417456686115544, 0.04181063355048408], [0.054129466248481264, 0.03254872837949123, 0.05374461924031552, 0.04538801863296238, 0.015795381235224913, 0.05055944754764062, 0.02017033995851422, -0.006505939129684573, 0.05497361331950649, 0.043858860182247515, 0.029356699144606032], [0.018799028456661045, 0.010582358713999951, 0.01871005609017673, 0.015795381235224913, 0.016260640022897347, 0.015459548918222347, 0.0064542528152879705, -0.0016656858963383602, 0.018761682220822192, 0.015361512546799405, 0.009832025009280924], [0.07523731808137141, 0.045792252383799206, 0.07487015790787396, 0.05055944754764062, 0.015459548918222347, 0.07207012779105286, 0.026330967917717253, -0.006907504360835279, 0.0753380831201204, 0.05335128471397023, 0.03998397595850863], [0.027682748950487425, 0.01969282984717051, 0.02756781074862818, 0.02017033995851422, 0.0064542528152879705, 0.026330967917717253, 0.020837940236441078, -0.003320408544812026, 0.027859582829638897, 0.01967636950969646, 0.017105000942890598], [-0.007296954729572955, -0.006089301208961258, -0.007327131572569985, -0.006505939129684573, -0.0016656858963383602, -0.006907504360835279, -0.003320408544812026, 0.024529061074105817, -0.007869287828047853, -0.006228903058681195, -0.0058974553248417995], [0.07935165417756569, 0.05067712814145293, 0.07895548129950304, 0.05497361331950649, 0.018761682220822192, 0.0753380831201204, 0.027859582829638897, -0.007869287828047853, 0.08169291677188911, 0.05731196406065222, 0.04450058445993234], [0.0569381100965656, 0.03146214776997301, 0.056417456686115544, 0.043858860182247515, 0.015361512546799405, 0.05335128471397023, 0.01967636950969646, -0.006228903058681195, 0.05731196406065222, 0.05064023101024737, 0.02830810316675855], [0.04185848489472492, 0.04452949330387575, 0.04181063355048408, 0.029356699144606032, 0.009832025009280924, 0.03998397595850863, 0.017105000942890598, -0.0058974553248417995, 0.04450058445993234, 0.02830810316675855, 0.040658283674780395]]
For this, if I compute y = multivariate_normal.pdf(x, mean, cov);
the result is 342562705.3859754.
How could this be the case? Am I missing something?
Thanks.
This is fine. The probability density function can be larger than 1 at a specific point. It's the integral than must be equal to 1.
The idea that pdf < 1 is correct for discrete variables. However, for continuous ones, the pdf is not a probability. It's a value that is integrated to a probability. That is, the integral from minus infinity to infinity, in all dimensions, is equal to 1.

How to do Scipy curve fitting with error bars and obtain standard errors on fitting parameters?

I am trying to fit my data points. It looks like the fitting without errors are not that optimistic, therefore now I am trying to fit the data implementing the errors at each point. My fit function is below:
def fit_func(x,a,b,c):
return np.log10(a*x**b + c)
then my data points are below:
r = [ 0.00528039,0.00721161,0.00873037,0.01108928,0.01413011,0.01790143,0.02263833, 0.02886089,0.03663713,0.04659512,0.05921978,0.07540126,0.09593949, 0.12190075,0.15501736,0.19713563,0.25041524,0.3185025,0.40514023,0.51507869, 0.65489938,0.83278859,1.05865016,1.34624082]
logf = [-1.1020581079659384, -1.3966927245616112, -1.4571368537041418, -1.5032694247562564, -1.8534775558300272, -2.2715812166948304, -2.2627690390113862, -2.5275290780299331, -3.3798813619309365, -6.0, -2.6270989211307034, -2.6549656159564918, -2.9366845162570079, -3.0955026428779604, -3.2649261507250289, -3.2837123017838366, -3.0493752067042856, -3.3133647996463229, -3.0865051494299243, -3.1347499415910169, -3.1433062918466632, -3.1747394718538979, -3.1797597345585245, -3.1913094832146616]
Because my data is in log scale, logf, then the error bar for each data point is not symmetric. The upper error bar and lower error bar are below:
upper = [0.070648916083227764, 0.44346256268274886, 0.11928131794776076, 0.094260899008089094, 0.14357124858039971, 0.27236750587684311, 0.18877122991380402, 0.28707938182603066, 0.72011863806906318, 0, 0.16813325716948757, 0.13624929595316049, 0.21847915642008875, 0.25456116079315372, 0.31078368240910148, 0.23178227464741452, 0.09158189214515966, 0.14020538489677881, 0.059482730164901909, 0.051786777740678414, 0.041126467609954531, 0.034394612910981337, 0.027206248503368613, 0.021847333685597548]
lower = [0.06074797748043137, 0.21479225959441428, 0.093479845697059583, 0.077406149968278104, 0.1077175009766278, 0.16610073183912188, 0.13114254113054535, 0.17133966123838595, 0.57498950902908286, 2.9786837094190934, 0.12090437578535695, 0.10355760401838676, 0.14467588244034646, 0.15942693835964539, 0.17929440903034921, 0.15031667827534712, 0.075592499975030591, 0.10581886912443572, 0.05230849287772843, 0.04626422871423852, 0.03756658820680725, 0.03186944137872727, 0.025601929615431285, 0.02080073540367966]
I have the fitting as:
popt, pcov = optimize.curve_fit(fit_func, r, logf,sigma=[lower,upper])
logf_fit = fit_func(r,*popt)
But this is wrong, how can I implement the curve fitting from scipy to include the upper and lower errors? How could I get the fitting errors of the fitting parameters a, b, c?
You can use scipy.optimize.leastsq with custom weights:
import scipy.optimize as optimize
import numpy as np
# redefine lists as array
x=np.array(r)
y=np.array(logf)
errup=np.array(upper)
errlow=np.array(lower)
# error function
def fit_func(x,a,b,c):
return np.log10(a*x**b + c)
def my_error(V):
a,b,c=V
yfit=fit_func(x,a,b,c)
weight=np.ones_like(yfit)
weight[yfit>y]=errup[yfit>y] # if the fit point is above the measure, use upper weight
weight[yfit<=y]=errlow[yfit<=y] # else use lower weight
return (yfit-y)**2/weight**2
answer=optimize.leastsq(my_error,x0=[0.0001,-1,0.0006])
a,b,c=answer[0]
print(a,b,c)
It works, but is very sensitive to initial values, since there is a log which can go in wrong domain (negative numbers) and then it fails. Here I find a=9.14464745425e-06 b=-1.75179880756 c=0.00066720486385which is pretty close to data.

Categories