given percentiles find distribution function python - python

From https://stackoverflow.com/a/30460089/2202107, we can generate CDF of a normal distribution:
import numpy as np
import matplotlib.pyplot as plt
N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)
# plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()
Question: How do we generate the "original" normal distribution, given only x (eg X2) and y (eg F2) coordinates?

My first thought was plt.plot(x,np.gradient(y)), but gradient of y was all zero (data points are evenly spaced in y, but not in x) These kind of data is often met in percentile calculations. The key is to get the data evenly space in x and not in y, using interpolation:
x=X2
y=F2
num_points=10
xinterp = np.linspace(-2,2,num_points)
yinterp = np.interp(xinterp, x, y)
# for normalizing that sum of all bars equals to 1.0
tot_val=1.0
normalization_factor = tot_val/np.trapz(np.ones(len(xinterp)),yinterp)
plt.bar(xinterp, normalization_factor * np.gradient(yinterp), width=0.2)
plt.show()
output looks good to me:
I put my approach here for examination. Let me know if my logic is flawed.
One issue is: when num_points is large, the plot looks bad, but it's a issue in discretization, not sure how to avoid it.
Related posts:
I failed to understand why the answer was so complicated in https://stats.stackexchange.com/a/6065/131632
I also didn't understand why my approach was different than Generate distribution given percentile ranks

Related

Is there a way to plot Nullclines of a nonlinear system of ODEs

So I am trying to plot the nullclines of a system of ODEs, however I can't seem to plot them in the correct way. When I plot them, I manage to plot them according to time (t vs x and t vs y) but not at (x vs y). I'm not really sure how to explain it, and I think it would be better to just show it. I am trying to replicate this. The equations and parameters are given, however this was done in a program called XPP (I'll post these at the bottom), and there are some parameters that i don't understand what they mean.
My entire code is:
import numpy as np
from scipy import integrate
import matplotlib.pyplot as plt
# define system in terms of a Numpy array
def Sys(X, t=0):
# here X[0] = x and x[1] = y
#protien [] is represented with y, and mRNA [] is represented by x
return np.array([ (k1*S*Kd**p)/(Kd**p + X[1]**p) - kdx*X[0], ksy*X[0] - (k2*ET*X[1])/(Km + X[1])])
#variables
k1=.1
S=1
Kd=1
kdx=.1
p=2
ksy=1
k2=1
ET=1
Km=1
# generate 1000 linearly spaced numbers for x-axes
t = np.linspace(0, 50,100)
# initial values
Sys0 = np.array([1, 0])
#Solves the ODE
X, infodict = integrate.odeint(Sys, Sys0, t, full_output = 1, mxstep = 50000)
#assigns appropriate equations to x and y
x,y = X.T
#plot's the graph
fig = plt.figure(figsize=(15,5))
fig.subplots_adjust(wspace = 0.5, hspace = 0.3)
ax1 = fig.add_subplot(1,2,1)
ax1.plot(x, color="blue")
ax1.plot(y, color = 'red')
ax1.set_xlabel("Protien concentration")
ax1.set_ylabel("mRNA concentration")
ax1.set_title("Phase space")
ax1.grid()
The given equations and parameters are:
model for a simple negative feedback loop
protein (y) inhibits the synthesis of its mRNA (x)
dx/dt = k1SKd^p/(Kd^p + y^p) - kdx*x
dy/dt = ksyx - k2ET*y/(Km + y)
p k1=0.1, S=1, Kd=1, kdx=0.1, p=2
p ksy=1, k2=1, ET=1, Km=1
# XP=y, YP=x, TOTAL=100, METH=stiff, XLO=0, XHI=4, YLO=0, YHI=1.05 (I don't exactly understand what is going on here)
Again, this uses a program called XPP or WINPP.
Any help with this would be appreciated, the original paper I am trying to replicate this from is : Design principles of biochemical oscillators by Bela Novak and John J. Tyson

Scaling x-axis after IFFT-FFT

See the edit below for details.
I have a dataset, on which I need to perform and IFFT, cut the valueable part of it (by multiplying with a gaussian curve), then FFT back.
First is in angular frequency domain, so an IFFT leads to time domain. Then FFT-ing back should lead to angular frequency again, but I can't seem to find a solution how to get back the original domain. Of course it's easy on the y-values:
yf = np.fft.ifft(y)
#cut the valueable part there..
np.fft.fft(yf)
For the x-value transforms I'm using np.fft.fftfreq the following way:
# x is in ang. frequency domain, that's the reason for the 2*np.pi division
t = np.fft.fftfreq(len(x), d=(x[1]-x[0])/(2*np.pi))
However doing
x = np.fft.fftfreq(len(t), d=2*np.pi*(t[1]-t[0]))
completely not giving me back the original x values. Is that something I'm misunderstanding?
The question can be asked generalized, for example:
import numpy as np
x = np.arange(100)
xx = np.fft.fftfreq(len(x), d = x[1]-x[0])
# how to get back the original x from xx? Is it even possible?
I've tried to use a temporal variable where I store the original x values, but it's not too elegant. I'm looking for some kind of inverse of fftfreq, and in general the possible best solution for that problem.
Thank you.
EDIT:
I will provide the code at the end.
I have a dataset which has angular frequency on x axis and intensity on the y. I want to perfrom IFFT to change to time domain. Unfortunately the x values are not
evenly spaced, so a (linear) interpolation is needed first before IFFT. Then in time domain the transform looks like this:
The next step is to cut one of the symmetrical spikes with a gaussian curve, then FFT back to angular frequency domain (the same where we started). My problem is when I transfrom the x-axis for the IFFT (which I think is correct), I can't get back into the original angular frequency domain. Here is the code, which includes the generator for the dataset too.
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy.interpolate import interp1d
C_LIGHT = 299.792
# for easier case, this is zero, so it can be ignored.
def _disp(x, GD=0, GDD=0, TOD=0, FOD=0, QOD=0):
return x*GD+(GDD/2)*x**2+(TOD/6)*x**3+(FOD/24)*x**4+(QOD/120)*x**5
# the generator to make sample datasets
def generator(start, stop, center, delay, GD=0, GDD=0, TOD=0, FOD=0, QOD=0, resolution=0.1, pulse_duration=15, chirp=0):
window = (np.sqrt(1+chirp**2)*8*np.log(2))/(pulse_duration**2)
lamend = (2*np.pi*C_LIGHT)/start
lamstart = (2*np.pi*C_LIGHT)/stop
lam = np.arange(lamstart, lamend+resolution, resolution)
omega = (2*np.pi*C_LIGHT)/lam
relom = omega-center
i_r = np.exp(-(relom)**2/(window))
i_s = np.exp(-(relom)**2/(window))
i = i_r + i_s + 2*np.sqrt(i_r*i_s)*np.cos(_disp(relom, GD=GD, GDD=GDD, TOD=TOD, FOD=FOD, QOD=QOD)+delay*omega)
#since the _disp polynomial is set to be zero, it's just cos(delay*omega)
return omega, i
def interpol(x,y):
''' Simple linear interpolation '''
xs = np.linspace(x[0], x[-1], len(x))
intp = interp1d(x, y, kind='linear', fill_value = 'extrapolate')
ys = intp(xs)
return xs, ys
def ifft_method(initSpectrumX, initSpectrumY, interpolate=True):
if len(initSpectrumY) > 0 and len(initSpectrumX) > 0:
Ydata = initSpectrumY
Xdata = initSpectrumX
else:
raise ValueError
N = len(Xdata)
if interpolate:
Xdata, Ydata = interpol(Xdata, Ydata)
# the (2*np.pi) division is because we have angular frequency, not frequency
xf = np.fft.fftfreq(N, d=(Xdata[1]-Xdata[0])/(2*np.pi)) * N * Xdata[-1]/(N-1)
yf = np.fft.ifft(Ydata)
else:
pass # some irrelevant code there
return xf, yf
def fft_method(initSpectrumX ,initSpectrumY):
if len(initSpectrumY) > 0 and len(initSpectrumX) > 0:
Ydata = initSpectrumY
Xdata = initSpectrumX
else:
raise ValueError
yf = np.fft.fft(Ydata)
xf = np.fft.fftfreq(len(Xdata), d=(Xdata[1]-Xdata[0])*2*np.pi)
# the problem is there, where I transform the x values.
xf = np.fft.ifftshift(xf)
return xf, yf
# the generated data
x, y = generator(1, 3, 2, delay = 1500, resolution = 0.1)
# plt.plot(x,y)
xx, yy = ifft_method(x,y)
#if the x values are correctly scaled, the two symmetrical spikes should appear exactly at delay value
# plt.plot(xx, np.abs(yy))
#do the cutting there, which is also irrelevant now
# the problem is there, in fft_method. The x values are not the same as before transforms.
xxx, yyy = fft_method(xx, yy)
plt.plot(xxx, np.abs(yyy))
#and it should look like this:
#xs = np.linspace(x[0], x[-1], len(x))
#plt.plot(xs, np.abs(yyy))
plt.grid()
plt.show()

Generating 3D Gaussian distribution in Python

I want to generate a Gaussian distribution in Python with the x and y dimensions denoting position and the z dimension denoting the magnitude of a certain quantity.
The distribution has a maximum value of 2e6 and a standard deviation sigma=0.025.
In MATLAB I can do this with:
x1 = linspace(-1,1,30);
x2 = linspace(-1,1,30);
mu = [0,0];
Sigma = [.025,.025];
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = 314159.153*reshape(F,length(x2),length(x1));
surf(x1,x2,F);
In Python, what I have so far is:
x = np.linspace(-1,1,30)
y = np.linspace(-1,1,30)
mu = (np.median(x),np.median(y))
sigma = (.025,.025)
There is a Numpy function numpy.random.multivariate_normal what can supposedly do the same as MATLAB's mvnpdf, but I am struggling to undestand the documentation. Especially in obtaining the covariance matrix needed by numpy.random.multivariate_normal.
As of scipy 0.14, you can use scipy.stats.multivariate_normal.pdf()
import numpy as np
from scipy.stats import multivariate_normal
x, y = np.mgrid[-1.0:1.0:30j, -1.0:1.0:30j]
# Need an (N, 2) array of (x, y) pairs.
xy = np.column_stack([x.flat, y.flat])
mu = np.array([0.0, 0.0])
sigma = np.array([.025, .025])
covariance = np.diag(sigma**2)
z = multivariate_normal.pdf(xy, mean=mu, cov=covariance)
# Reshape back to a (30, 30) grid.
z = z.reshape(x.shape)
I am working on a scikit called scikit-guess that contains some fast estimation routines for non-linear fits. It has a function skg.ngauss.model (also accessible as skg.ngauss_fit.model or skg.ngauss.ngauss_fit.model) which does exactly what you want. The nice thing is that it's not a PDF, so you set the amplitude out of the box:
import numpy as np
import skg.ngauss
a = 2e6
mu = 0, 0
sigma = 0.025, 0.025
x = y = np.linspace(-1, 1, 31)
cov = np.diag(sigma)**2
X = np.meshgrid(x, y)
data = skg.ngauss.model(X, a, mu, cov, axis=0)
You need to tell it axis=0 because it automatically stacks your arrays for you. To avoid passing in that argument, you could write
X = np.stack(np.meshgrid(x, y), axis=-1)
You can plot the result:
from matplotlib import pyplot as plt
plt.imshow(data)
plt.show()
This is not a very exciting distribution because the spread is so small that you end up with a value of ~2e-5 just one pixel away. You may want to up your sampling space to get any sort of meaningful resolution.
Note: At time of writing, the fitting function (ngauss_fit) is still buggy, but the model has been tested successfully, just not in the scikit.
Disclaimer: In case it wasn't obvious from the above, I am the author of scikit-guess.

Discretize path with numpy array and equal distance between points

Lets say I have a path in a 2d-plane given by a parametrization, for example the archimedian spiral:
x(t) = a*φ*cos(φ), y(t) = a*φ*sin(φ)
Im looking for a way to discretize this with a numpy array,
the problem is if I use
a = 1
phi = np.arange(0, 10*np.pi, 0.1)
x = a*phi*np.cos(phi)
y = a*phi*np.sin(phi)
plt.plot(x,y, "ro")
I get a nice curve but the points don't have the same distance, for
growing φ the distance between 2 points gets larger.
Im looking for a nice and if possible fast way to do this.
It might be possible to get the exact analytical formula for your simple spiral, but I am not in the mood to do that and this might not be possible in a more general case. Instead, here is a numerical solution:
import matplotlib.pyplot as plt
import numpy as np
a = 1
phi = np.arange(0, 10*np.pi, 0.1)
x = a*phi*np.cos(phi)
y = a*phi*np.sin(phi)
dr = (np.diff(x)**2 + np.diff(y)**2)**.5 # segment lengths
r = np.zeros_like(x)
r[1:] = np.cumsum(dr) # integrate path
r_int = np.linspace(0, r.max(), 200) # regular spaced path
x_int = np.interp(r_int, r, x) # interpolate
y_int = np.interp(r_int, r, y)
plt.subplot(1,2,1)
plt.plot(x, y, 'o-')
plt.title('Original')
plt.axis([-32,32,-32,32])
plt.subplot(1,2,2)
plt.plot(x_int, y_int, 'o-')
plt.title('Interpolated')
plt.axis([-32,32,-32,32])
plt.show()
It calculates the length of all the individual segments, integrates the total path with cumsum and finally interpolates to get a regular spaced path. You might have to play with your step-size in phi, if it is too large you will see that the spiral is not a smooth curve, but instead built from straight line segments. Result:

Fitting a Weibull distribution using Scipy

I am trying to recreate maximum likelihood distribution fitting, I can already do this in Matlab and R, but now I want to use scipy. In particular, I would like to estimate the Weibull distribution parameters for my data set.
I have tried this:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
def weib(x,n,a):
return (a / n) * (x / n)**(a - 1) * np.exp(-(x / n)**a)
data = np.loadtxt("stack_data.csv")
(loc, scale) = s.exponweib.fit_loc_scale(data, 1, 1)
print loc, scale
x = np.linspace(data.min(), data.max(), 1000)
plt.plot(x, weib(x, loc, scale))
plt.hist(data, data.max(), density=True)
plt.show()
And get this:
(2.5827280639441961, 3.4955032285727947)
And a distribution that looks like this:
I have been using the exponweib after reading this http://www.johndcook.com/distributions_scipy.html. I have also tried the other Weibull functions in scipy (just in case!).
In Matlab (using the Distribution Fitting Tool - see screenshot) and in R (using both the MASS library function fitdistr and the GAMLSS package) I get a (loc) and b (scale) parameters more like 1.58463497 5.93030013. I believe all three methods use the maximum likelihood method for distribution fitting.
I have posted my data here if you would like to have a go! And for completeness I am using Python 2.7.5, Scipy 0.12.0, R 2.15.2 and Matlab 2012b.
Why am I getting a different result!?
My guess is that you want to estimate the shape parameter and the scale of the Weibull distribution while keeping the location fixed. Fixing loc assumes that the values of your data and of the distribution are positive with lower bound at zero.
floc=0 keeps the location fixed at zero, f0=1 keeps the first shape parameter of the exponential weibull fixed at one.
>>> stats.exponweib.fit(data, floc=0, f0=1)
[1, 1.8553346917584836, 0, 6.8820748596850905]
>>> stats.weibull_min.fit(data, floc=0)
[1.8553346917584836, 0, 6.8820748596850549]
The fit compared to the histogram looks ok, but not very good. The parameter estimates are a bit higher than the ones you mention are from R and matlab.
Update
The closest I can get to the plot that is now available is with unrestricted fit, but using starting values. The plot is still less peaked. Note values in fit that don't have an f in front are used as starting values.
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> plt.plot(data, stats.exponweib.pdf(data, *stats.exponweib.fit(data, 1, 1, scale=02, loc=0)))
>>> _ = plt.hist(data, bins=np.linspace(0, 16, 33), normed=True, alpha=0.5);
>>> plt.show()
It is easy to verify which result is the true MLE, just need a simple function to calculate log likelihood:
>>> def wb2LL(p, x): #log-likelihood
return sum(log(stats.weibull_min.pdf(x, p[1], 0., p[0])))
>>> adata=loadtxt('/home/user/stack_data.csv')
>>> wb2LL(array([6.8820748596850905, 1.8553346917584836]), adata)
-8290.1227946678173
>>> wb2LL(array([5.93030013, 1.57463497]), adata)
-8410.3327470347667
The result from fit method of exponweib and R fitdistr (#Warren) is better and has higher log likelihood. It is more likely to be the true MLE. It is not surprising that the result from GAMLSS is different. It is a complete different statistic model: Generalized Additive Model.
Still not convinced? We can draw a 2D confidence limit plot around MLE, see Meeker and Escobar's book for detail).
Again this verifies that array([6.8820748596850905, 1.8553346917584836]) is the right answer as loglikelihood is lower that any other point in the parameter space. Note:
>>> log(array([6.8820748596850905, 1.8553346917584836]))
array([ 1.92892018, 0.61806511])
BTW1, MLE fit may not appears to fit the distribution histogram tightly. An easy way to think about MLE is that MLE is the parameter estimate most probable given the observed data. It doesn't need to visually fit the histogram well, that will be something minimizing mean square error.
BTW2, your data appears to be leptokurtic and left-skewed, which means Weibull distribution may not fit your data well. Try, e.g. Gompertz-Logistic, which improves log-likelihood by another about 100.
Cheers!
I know it's an old post, but I just faced a similar problem and this thread helped me solve it. Thought my solution might be helpful for others like me:
# Fit Weibull function, some explanation below
params = stats.exponweib.fit(data, floc=0, f0=1)
shape = params[1]
scale = params[3]
print 'shape:',shape
print 'scale:',scale
#### Plotting
# Histogram first
values,bins,hist = plt.hist(data,bins=51,range=(0,25),normed=True)
center = (bins[:-1] + bins[1:]) / 2.
# Using all params and the stats function
plt.plot(center,stats.exponweib.pdf(center,*params),lw=4,label='scipy')
# Using my own Weibull function as a check
def weibull(u,shape,scale):
'''Weibull distribution for wind speed u with shape parameter k and scale parameter A'''
return (shape / scale) * (u / scale)**(shape-1) * np.exp(-(u/scale)**shape)
plt.plot(center,weibull(center,shape,scale),label='Wind analysis',lw=2)
plt.legend()
Some extra info that helped me understand:
Scipy Weibull function can take four input parameters: (a,c),loc and scale.
You want to fix the loc and the first shape parameter (a), this is done with floc=0,f0=1. Fitting will then give you params c and scale, where c corresponds to the shape parameter of the two-parameter Weibull distribution (often used in wind data analysis) and scale corresponds to its scale factor.
From docs:
exponweib.pdf(x, a, c) =
a * c * (1-exp(-x**c))**(a-1) * exp(-x**c)*x**(c-1)
If a is 1, then
exponweib.pdf(x, a, c) =
c * (1-exp(-x**c))**(0) * exp(-x**c)*x**(c-1)
= c * (1) * exp(-x**c)*x**(c-1)
= c * x **(c-1) * exp(-x**c)
From this, the relation to the 'wind analysis' Weibull function should be more clear
I was curious about your question and, despite this is not an answer, it compares the Matlab result with your result and with the result using leastsq, which showed the best correlation with the given data:
The code is as follows:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as mtrand
from scipy.integrate import quad
from scipy.optimize import leastsq
## my distribution (Inverse Normal with shape parameter mu=1.0)
def weib(x,n,a):
return (a / n) * (x / n)**(a-1) * np.exp(-(x/n)**a)
def residuals(p,x,y):
integral = quad( weib, 0, 16, args=(p[0],p[1]) )[0]
penalization = abs(1.-integral)*100000
return y - weib(x, p[0],p[1]) + penalization
#
data = np.loadtxt("stack_data.csv")
x = np.linspace(data.min(), data.max(), 100)
n, bins, patches = plt.hist(data,bins=x, normed=True)
binsm = (bins[1:]+bins[:-1])/2
popt, pcov = leastsq(func=residuals, x0=(1.,1.), args=(binsm,n))
loc, scale = 1.58463497, 5.93030013
plt.plot(binsm,n)
plt.plot(x, weib(x, loc, scale),
label='weib matlab, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
loc, scale = s.exponweib.fit_loc_scale(data, 1, 1)
plt.plot(x, weib(x, loc, scale),
label='weib stack, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
plt.plot(x, weib(x,*popt),
label='weib leastsq, loc=%1.3f, scale=%1.3f' % tuple(popt), lw=4.)
plt.legend(loc='upper right')
plt.show()
I had the same problem, but found that setting loc=0 in exponweib.fit primed the pump for the optimization. That was all that was needed from #user333700's answer. I couldn't load your data -- your data link points to an image, not data. So I ran a test on my data instead:
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
N=30
counts, bins = np.histogram(x, bins=N)
bin_width = bins[1]-bins[0]
total_count = float(sum(counts))
f, ax = plt.subplots(1, 1)
f.suptitle(query_uri)
ax.bar(bins[:-1]+bin_width/2., counts, align='center', width=.85*bin_width)
ax.grid('on')
def fit_pdf(x, name='lognorm', color='r'):
dist = getattr(ss, name) # params = shape, loc, scale
# dist = ss.gamma # 3 params
params = dist.fit(x, loc=0) # 1-day lag minimum for shipping
y = dist.pdf(bins, *params)*total_count*bin_width
sqerror_sum = np.log(sum(ci*(yi - ci)**2. for (ci, yi) in zip(counts, y)))
ax.plot(bins, y, color, lw=3, alpha=0.6, label='%s err=%3.2f' % (name, sqerror_sum))
return y
colors = ['r-', 'g-', 'r:', 'g:']
for name, color in zip(['exponweib', 't', 'gamma'], colors): # 'lognorm', 'erlang', 'chi2', 'weibull_min',
y = fit_pdf(x, name=name, color=color)
ax.legend(loc='best', frameon=False)
plt.show()
There have been a few answers to this already here and in other places. likt in Weibull distribution and the data in the same figure (with numpy and scipy)
It still took me a while to come up with a clean toy example so I though it would be useful to post.
from scipy import stats
import matplotlib.pyplot as plt
#input for pseudo data
N = 10000
Kappa_in = 1.8
Lambda_in = 10
a_in = 1
loc_in = 0
#Generate data from given input
data = stats.exponweib.rvs(a=a_in,c=Kappa_in, loc=loc_in, scale=Lambda_in, size = N)
#The a and loc are fixed in the fit since it is standard to assume they are known
a_out, Kappa_out, loc_out, Lambda_out = stats.exponweib.fit(data, f0=a_in,floc=loc_in)
#Plot
bins = range(51)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(bins, stats.exponweib.pdf(bins, a=a_out,c=Kappa_out,loc=loc_out,scale = Lambda_out))
ax.hist(data, bins = bins , density=True, alpha=0.5)
ax.annotate("Shape: $k = %.2f$ \n Scale: $\lambda = %.2f$"%(Kappa_out,Lambda_out), xy=(0.7, 0.85), xycoords=ax.transAxes)
plt.show()
In the meantime, there is really good package out there: reliability. Here is the documentation: reliability # readthedocs.
Your code simply becomes:
from reliability.Fitters import Fit_Weibull_2P
...
wb = Fit_Weibull_2P(failures=data)
plt.show()
Saves a lot of headaches and makes beautiful plots, too.
the order of loc and scale is messed up in the code:
plt.plot(x, weib(x, scale, loc))
the scale parameter should come first.

Categories