Gaussian Distribution -- Unexpected distribution

Gaussian Distribution -- Unexpected distribution - python

I am new to the Gaussian Distribution.
But a little strange here.
Is there any idea what kind of occurrence could cause an odd bin between -1 and -0.5 to occur?
Here is my code(But I am not sure whether it help or not)
mu_0 = 0.178950369
srd_0 = 0.455387161
## aa is a list of float value
aa_list = np.array(aa)
data = aa_list * srd_0 + mu_0
data = data.reshape(-1, 1)
figure(num=None, figsize=(12,8), dpi=400, facecolor='w', edgecolor='k')
hx, hy, _ = plt.hist(data, bins=50, density=1,color="blue")
plt.ylim(0.0,1.5)
plt.title('Gaussian mixture example 01')
plt.grid()
plt.xlim(mu_0-4*srd_0,mu_0+4*srd_0)

Without seeing the values of aa, I'd guess that your data is being clipped on the low end. So, instead of being allowed to go as negative as it wants to go, it's being restricted to some value.
Then, because it wants to stay below that clipping threshold, many samples are at that negative value.
This doesn't appear to have much to do with the Gaussian Mixture Model of your title, though. A Gaussian Mixture Model is a stochastic (random) model that assumes the data is generated by several distinct Gaussian distributions added together. Mixture here means "added".
How to solve it?
This has to start with how the aa data is being acquired. How was it generated? Is the equipment used to capture it set up correctly? Are there some NaN values in the aa data that are converting to a nonsense value? If so, replace the NaN with the average of the non-NaN values on either side of the NaN value.

It's impossible to know without the rest of the code, but it seems the problem may be in cutting off the data. If your code implicitly (or explicitly) has something along the lines of:
if x < -0.75:
x = -0.75
Then that would explain the cutoff- all the values less than -0.75 get rounded up to -0.75, hence the spike.
It's also possible that you aren't generating enough data, and that the spike at -0.75 is just a random fluctuation. One consequence of the central limit theorem is that you need around 20 repeats of a thing (but the more, the better) for it to be statistically significant, so you may be able to get rid of that spike just by throwing more data at it. If not, it's almost certainly the cutoff thing.
If it's the cutoff problem, changing the rest of the code to do something equivalent to:-
if x < -0.75:
{x is not counted}
-:should fix it, at least on the range you're interested in.

Related

Exclude/Ignore data region in polynomial fit (zfit)

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:
import numpy as np
import pandas as pd
import zfit
# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])
# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)
# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))
# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)
TestSample.png
Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.
I also tried using MultiSpaces to ignore that region via
limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2
But this leads to a
ValueError: obs need to be a Space with exactly one limit if rescaling is requested.
Anyone has an idea how to solve this?
Thanks in advance ^^

Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.
What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.
This can normally be done in two ways:
Using multispace
using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).
Simultaneous fit
As mentioned in the comments by #jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).
So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.
Solution 1: different space and norm
Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.
Solution 2: manual re-scaling
The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

Fitting data with function with several domains whose boundaries are variable

As the title mentions, I am having trouble fitting data points to a function with 3 domains whose boundaries are a parameter of my function. Here is the function I am dealing with:
global sigma_m
sigma_m=2*10**(-12)
global sigma_f
sigma_f=10**3
def Conductivity (phi,phi_c,t,s):
sigma=[0]*(len(phi))
for i in range (0,len(phi)):
if phi[i]<phi_c:
sigma[i]=sigma_m*(phi_c-phi[i])**(-s)
elif phi[i]==phi_c:
sigma[i]=sigma_f*(sigma_m/sigma_f)**(t/(t+s))
else:
sigma[i]=sigma_f*(phi[i]-phi_c)**t
return sigma
And my data points are:
phi_data=[0,0.005,0.007,0.008,0.017,0.05,0.085,0.10]
sigma_data=[2.00E-12,2.50E-12,3.00E-12,9.00E-04,1.00E-01,1.00E+00,2.00E+00,3.00E+00]
My constraints are that phi_c, s, and t must be strictly greater than zero (in practice, phi_c is rarely higher than 0.1 but higher than 0.001, s is usually between 0.5 and 1.5, and t is usually anywhere between 1.5 and 6).
My goal is to fit my data points and have my fit give me values of phi_c, s, and t. s and t can be estimated to help the code (in the specific set of data points that I showed, t should be around 2, and s should be around 0.5). phi_c is completely unknown, except for the range of values that I mentioned just above.
I have used both curve_fit from scipy and Model from lmfit but both provide ridiculously small phi_c values (like 10**(-16) or similarly small values that make me believe the programme wants phi_c to be negative).
Here is my code for when I used curve_fit:
popt, pcov = curve_fit(Conductivity, phi_data, sigma_data, p0=[0.01,2,0.5], bounds=(0,[0.5,10,3]))
Here is my code for when I used Model from lmfit:
t_estimate=0.5
s_estimate=2
phi_c_estimate=0.005
condmodel = Model(Conductivity)
params = condmodel.make_params(phi_c=phi_c_estimate,t=t_estimate,s=s_estimate)
result = condmodel.fit(sigma_data, params, phi=phi_data)
params['phi_c'].min = 0
params['phi_c'].max = 0.1
Both options give an okay fit when plotted, but the estimated value of phi_c is nowhere near plausible.
If you have any idea what I could do to have a better fit, please let me know!
PS: I have a read a promising post about using the package symfit to fit the data on the different regions separately, unfortunately the package symfit does not work for me. It keeps uninstalling my version of scipy then reinstalling an older version, and then it tells me it needs a newer version of scipy to function.
EDIT: I managed to make the symfit package work. Here is my entire code:
from symfit import parameters, variables, Fit, Piecewise, exp, Eq
import numpy as np
import matplotlib.pyplot as plt
global sigma_m
sigma_m=2*10**(-12)
global sigma_f
sigma_f=10**3
phi, sigma = variables ('phi, sigma')
t, s, phi_c = parameters('t, s, phi_c')
phi_c.min = 0.001
phi_c.max = 0.1
sigma1 = sigma_m*(phi_c-phi)**(-s)
sigma2 = sigma_f*(phi-phi_c)**t
model = {sigma: Piecewise ((sigma1, phi <= phi_c), (sigma2, phi > phi_c))}
constraints = [Eq(sigma1.subs({phi: phi_c}), sigma2.subs({phi: phi_c}))]
phi_data=np.array([0,0.005,0.007,0.008,0.017,0.05,0.085,0.10])
sigma_data=np.array([2.00E-12,2.50E-12,3.00E-12,9.00E-04,1.00E-01,1.00E+00,2.00E+00,3.00E+00])
fit = Fit(model, phi=phi_data, sigma=sigma_data, constraints=constraints)
fit_result = fit.execute()
print(fit_result)
Unfortunately I get the following error:
File "D:\Programs\Anaconda\lib\site-packages\sympy\printing\pycode.py", line 236, in _print_ComplexInfinity
return self._print_NaN(expr)
File "D:\Programs\Anaconda\lib\site-packages\sympy\printing\pycode.py", line 74, in _print_known_const
known = self.known_constants[expr.__class__.__name__]
KeyError: 'ComplexInfinity'
My knowledge of coding is very limited, I have no idea what this means and what I should do to not have this error anymore. Please let me know if you have an idea.

I'm not certain that I have a single answer for you, but this will be too long to fit into a comment.
First, a model that switches functional form is especially challenging. But, what's more is that your form has
elif phi[i]==phi_c:
For floating point numbers that are variables, this is going to basically never be true. You might not mean "exactly equal" but "pretty close", which might be
elif abs(phi[i] - phi_c) < 1.0e-5:
or something...
But also, converting that from a for loop to using numpy.where() is probably worth looking into.
Second, it is not at all clear that your different forms actually evaluate to the same values at the boundaries to ensure a continuous function. You might want to check that.
Third, models with powers and exponentials are especially challenging to fit as a small change in power can have a huge impact on the resulting value. It's also very easy to get "negative value raised to non-integer value", which is of course, complex.
Fourth, those sigma_m and sigma_f constants look like they could easily cause trouble. You should definitely evaluate your model with your starting parameter values and see if you can sort of reproduce your data with your model and reasonable starting values. I suspect that you'll need to change your starting values.

Having trouble plotting a log-log plot in python

Hey so I'm trying to plot variables like age against its frequency, for a rotating body. I am given the period and period derivative aswell as their associated errors. Since frequency is related to period by:
f = 1/T
where frequency is f and period is T
then,
df = - (1/(T^2)) * dT
where dT and dF are the derivatives of period and frequency
but when it comes to plotting the log of this I can't do it in python as it doesn't accept negative values for a loglog plot.
I've tried a work around of using only absolute values but then I only get half the errors when plotting error bars. Is there a way to make python plot both the negative and positive error bars? The frequency derivative itself is a negative quantity.

Unfortunately, log(x) cannot be negative because log(x) = y <=> 10^y = x.
Is 10^y ever going to be -5?
Unfortunately it is impossible to make 10^y<=0 because as y becomes -infinity, x approaches 1/infinity; x approaches, but never passes 0.
Is it possible to plot log(x), where x is negative?
One simple solution to your problem however, is to take the absolute value of df. By doing this, negative numbers become positive. The only downside is that after you've transformed the data this way, you will need to undo the transformation. If the number was negative (and turned positive due to abs(df)), then you must multiply it by -1 afterwards.
You may need to define your own absolute value function that records any values it needs to make positive:
changeList = []
def absRecordChanges(value):
if value < 0 :
value = value * -1
changeList.append(value)
return value
There are other ways to solve the problem, but they are all centred around transforming your data to meet the conditions of a log tranformation (x > 0), and having the data you changed recorded so you can change it back afterward (before you plot it).
EDIT:
While fiddling around in desmos, I was able to plot log(x) where x is any integer. I used a piecewise function to do this: {x<0:-log(abs(x)),log (x)}.
def piecewiseLog(x)
If x <= 0 :
return -log(abs(x))
else :
return log(x)
As I'm not familiar with matlab syntax, this link has an alternative solution: http://www.mathworks.com/matlabcentral/answers/31566-display-negative-values-on-logarithmic-graph

Fitting Parametric Curves in Python

I have experimental data of the form (X,Y) and a theoretical model of the form (x(t;*params),y(t;*params)) where t is a physical (but unobservable) variable, and *params are the parameters that I want to determine. t is a continuous variable, and there is a 1:1 relationship between x and t and between y and t in the model.
In a perfect world, I would know the value of T (the real-world value of the parameter) and would be able to do an extremely basic least-squares fit to find the values of *params. (Note that I am not trying to "connect" the values of x and y in my plot, like in 31243002 or 31464345.) I cannot guarantee that in my real data, the latent value T is monotonic, as my data is collected across multiple cycles.
I'm not very experienced doing curve fitting manually, and have to use extremely crude methods without easy access to a basic scipy function. My basic approach involves:
Choose some value of *params and apply it to the model
Take an array of t values and put it into the model to create an array of model(*params) = (x(*params),y(*params))
Interpolate X (the data values) into model to get Y_predicted
Run a least-squares (or other) comparison between Y and Y_predicted
Do it again for a new set of *params
Eventually, choose the best values for *params
There are several obvious problems with this approach.
1) I'm not experienced enough with coding to develop a very good "do it again" other than "try everything in the solution space," of maybe "try everything in a coarse grid" and then "try everything again in a slightly finer grid in the hotspots of the coarse grid." I tried doing MCMC methods, but I never found any optimum values, largely because of problem 2
2) Steps 2-4 are super inefficient in their own right.
I've tried something like (resembling pseudo-code; the actual functions are made up). There are many minor quibbles that could be made about using broadcasting on A,B, but those are less significant than the problem of needing to interpolate for every single step.
People I know have recommended using some sort of Expectation Maximization algorithm, but I don't know enough about that to code one up from scratch. I'm really hoping there's some awesome scipy (or otherwise open-source) algorithm I haven't been able to find that covers my whole problem, but at this point I am not hopeful.
import numpy as np
import scipy as sci
from scipy import interpolate
X_data
Y_data
def x(t,A,B):
return A**t + B**t
def y(t,A,B):
return A*t + B
def interp(A,B):
ts = np.arange(-10,10,0.1)
xs = x(ts,A,B)
ys = y(ts,A,B)
f = interpolate.interp1d(xs,ys)
return f
N = 101
lsqs = np.recarray((N**2),dtype=float)
count = 0
for i in range(0,N):
A = 0.1*i #checks A between 0 and 10
for j in range(0,N):
B = 10 + 0.1*j #checks B between 10 and 20
f = interp(A,B)
y_fit = f(X_data)
squares = np.sum((y_fit - Y_data)**2)
lsqs[count] = (A,b,squares) #puts the values in place for comparison later
count += 1 #allows us to move to the next cell
i = np.argmin(lsqs[:,2])
A_optimal = lsqs[i][0]
B_optimal = lsqs[i][1]

If I understand the question correctly, the params are constants which are the same in every sample, but t varies from sample to sample. So, for example, maybe you have a whole bunch of points which you believe have been sampled from a circle
x = a+r cos(t)
y = b+r sin(t)
at different values of t.
In this case, what I would do is eliminate the variable t to get a relation between x and y -- in this case, (x-a)^2+(y-b)^2 = r^2. If your data fit the model perfectly, you would have (x-a)^2+(y-b)^2 = r^2 at each of your data points. With some error, you could still find (a,b,r) to minimize
sum_i ((x_i-a)^2 + (y_i-b)^2 - r^2)^2.
Mathematica's Eliminate command can automate the procedure of eliminating t in some cases.
PS You might do better at stats.stackexchange, math.stackexchange or mathoverflow.net . I know the last one has a scary reputation, but we don't bite, really!

(Python) Estimating regression parameter confidence intervals with scikits bootstrap

I've just started to try out a nice bootstrapping package available through scikits:
https://github.com/cgevans/scikits-bootstrap
but I've encountered a problem when trying to estimate confidence intervals for the correlation coefficient from linear regression. The confidence intervals returned lie completely outside the range of the original statistic.
Here is the code:
import numpy as np
from scipy import stats
import bootstrap as boot
np.random.seed(0)
x = np.arange(10)
y = 10 + 1.5*x + 2*np.random.randn(10)
r0 = stats.linregress(x, y)[2]
def my_function(y):
return stats.linregress(x, y)[2]
ci = boot.ci(y, statfunction=my_function, alpha=0.05, n_samples=1000, method='pi')
This yields a result of ci = [-0.605, 0.644], but the original statistic is r0=0.894.
I've tried this in R and it seems to work fine there: the ci straddles r0 as expected.
Please help!

Could you provide your R code? I'd be interested in knowing how this is dealt with in R.
The problem here is that you're only passing y to boot.ci, but every time it runs my_function, it uses the entire, original x (note the lack of x input to my_function). Bootstrapping applies the statistic function to resampled data, so if you're applying your statistic function using the original x and a sample of y, you're going to have a nonsensical result. This is why the BCA method doesn't work at all, actually: it can't apply your statistic function to jackknife samples, which don't have the same number of elements.
Samples are taken along axis 0 (rows), so if you want to pass multiple 1D arrays to your statistic function, you can use multiple columns: xy = vstack((x,y)).T would work, and then use a statfunction that takes data from those columns:
def my_function(xysample):
return stats.linregress(xysample[:,0], xysample[:,1])[2]
Alternatively, if you wanted to avoid messing with your data at all, you could define a function that operates on indexes, and then just pass indexes to boot.ci:
def my_function2(i):
return stats.linregress(x[i], y[i])[2]
boot.ci(np.arange(len(x)), statfunction=my_function2, alpha=0.05, n_samples=1000, method='pi')
Note that in either of these cases, BCA works, so you may as well use method='bca' unless you really do want to use percentage intervals; BCA is pretty much always better.
I do realize that both of these methods are less than ideal. Honestly, I've never had a need to pass multiple arrays like this to my statfunction, and the majority of people are likely using mean as their statfunction. I think the best idea here may be to allow lists of equally-size[0] arrays to be passed, eg, boot.ci([x,y],...), and then sample all of those at the same time and pass them all to the statfunction as separate arguments. In that case, you could just have a my_function(x,y). I'll see if I can do this, but if you can show me your R code, that would be great, as I'd like to see if there is a better way of dealing with this.
Update:
In the most recent version of scikits.bootstrap (v0.3.1), a tuple of arrays can be provided, and samples from them will be passed as separate arguments to statfunction. Additionally, statfunction can provide array output, and confidence intervals will be calculated for each point in the output. Thus, this is now very easy to do. The following will give confidence intervals for every output of linregress:
cis = boot.ci( (x,y), statfunction=stats.linregress )
cis[:,2] in this case will be the desired confidence interval.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.