Getting a pdf from scipy.stats in a generic way - python

I am running some goodness of fit tests using scipy.stats in Python 2.7.10.
for distrName in distrNameList:
distr = getattr(distributions, distrName)
param = distr.fit(sample)
pdf = distr.pdf(???)
What do I pass into distr.pdf() to get the values of the best-fit pdf on the list of sample points of interest, called abscissas?

From the documentation, the .fit() method returns:
shape, loc, scale : tuple of floats
MLEs for any shape statistics, followed by those for location and scale.
and the .pdf() method accepts:
x : array_like
quantiles
arg1, arg2, arg3,... : array_like
The shape parameter(s) for the distribution (see docstring of the instance object for more information)
loc : array_like, optional
location parameter (default=0)
scale : array_like, optional
So essentially you would do something like this:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
# some random variates drawn from a beta distribution
rvs = stats.beta.rvs(2, 5, loc=0, scale=1, size=1000)
# estimate distribution parameters, in this case (a, b, loc, scale)
params = stats.beta.fit(rvs)
# evaluate PDF
x = np.linspace(0, 1, 1000)
pdf = stats.beta.pdf(x, *params)
# plot
fig, ax = plt.subplots(1, 1)
ax.hold(True)
ax.hist(rvs, normed=True)
ax.plot(x, pdf, '--r')

To evaluate the pdf at abscissas, you would pass abcissas as the first argument to pdf. To specify the parameters, use the * operator to unpack the param tuple and pass those values to distr.pdf:
pdf = distr.pdf(abscissas, *param)
For example,
import numpy as np
import scipy.stats as stats
distrNameList = ['beta', 'expon', 'gamma']
sample = stats.norm(0, 1).rvs(1000)
abscissas = np.linspace(0,1, 10)
for distrName in distrNameList:
distr = getattr(stats.distributions, distrName)
param = distr.fit(sample)
pdf = distr.pdf(abscissas, *param)
print(pdf)

Related

What's the equivalent of fitdist and histfit in Python?

--- SAMPLE ---
I have a data set (sample) that contains 1 000 damage values (the values are very small <1e-6) in a 1-dimension array (see the attached .json file). The sample is seemed to follow Lognormal distribution:
--- PROBLEM & WHAT I ALREADY TRIED ---
I tried the suggestions in this post Fitting empirical distribution to theoretical ones with Scipy (Python)? and this post Scipy: lognormal fitting to fit my data by lognormal distribution. None of these works. :(
I always get something very large in Y-axis as the following:
Here is the code that I used in Python (and the data.json file can be downloaded from here):
from matplotlib import pyplot as plt
from scipy import stats as scistats
import json
with open("data.json", "r") as f:
sample = json.load(f) # load data: a 1000 * 1 array with many small values( < 1e-6)
fig, axis = plt.subplots() # initiate a figure
N, nbins, patches = axis.hist(sample, bins = 40) # plot sample by histogram
axis.ticklabel_format(style = 'sci', scilimits = (-3, 4), axis = 'x') # make X-axis to use scitific numbers
axis.set_xlabel("Value")
axis.set_ylabel("Count")
plt.show()
fig, axis = plt.subplots()
param = scistats.lognorm.fit(sample) # fit data by Lognormal distribution
pdf_fitted = scistats.lognorm.pdf(nbins, * param[: -2], loc = param[-2], scale = param[-1]) # prepare data for ploting fitted distribution
axis.plot(nbins, pdf_fitted) # draw fitted distribution on the same figure
plt.show()
I tried the other kind of distribution, but when I try to plot the result, the Y-axis is always too large and I can't plot with my histogram. Where did I fail ???
I'have also tried out the suggestion in my another question: Use scipy lognormal distribution to fit data with small values, then show in matplotlib. But the value of variable pdf_fitted is always too big.
--- EXPECTING RESULT ---
Basically, what I want is like this:
And here is the Matlab code that I used in the above screenshot:
fname = 'data.json';
sample = jsondecode(fileread(fname));
% fitting distribution
pd = fitdist(sample, 'lognormal')
% A combined command for plotting histogram and distribution
figure();
histfit(sample,40,"lognormal")
So if you have any idea of the equivalent command of fitdist and histfit in Python/Scipy/Numpy/Matplotlib, please post it !
Thanks a lot !
Try the distfit (or fitdist) library.
https://erdogant.github.io/distfit
pip install distfit
import numpy as np
# Example data
X = np.random.normal(10, 3, 2000)
y = [3,4,5,6,10,11,12,18,20]
# From the distfit library import the class distfit
from distfit import distfit
# Initialize
dist = distfit()
# Search for best theoretical fit on your emperical data
dist.fit_transform(X)
# Plot
dist.plot()
# summay plot
dist.plot_summary()
So in your case it would be:
dist = distfit(distr='lognorm')
dist.fit_transform(X)
Try seaborn:
import seaborn as sns, numpy as np
sns.set(); np.random.seed(0)
x = np.random.randn(100)
ax = sns.distplot(x)
I tried your dataset using Openturns library
x is the list given in you json file.
import openturns as ot
from openturns.viewer import View
import matplotlib.pyplot as plt
# first format your list x as a sample of dimension 1
sample = ot.Sample(x,1)
# use the LogNormalFactory to build a Lognormal distribution according to your sample
distribution = ot.LogNormalFactory().build(sample)
# draw the pdf of the obtained distribution
graph = distribution.drawPDF()
graph.setLegends(["LogNormal"])
View(graph)
plt.show()
If you want the parameters of the distribution
print(distribution)
>>> LogNormal(muLog = -16.5263, sigmaLog = 0.636928, gamma = 3.01106e-08)
You can build the histogram the same way by calling HistogramFactory, then you can add one graph to another:
graph2 = ot.HistogramFactory().build(sample).drawPDF()
graph2.setColors(['blue'])
graph2.setLegends(["Histogram"])
graph2.add(graph)
View(graph2)
and set the boundaries values if you want to zoom
axes = view.getAxes()
_ = axes[0].set_xlim(-0.6e-07, 2.8e-07)
plt.show()

Scipy.stats's fit and pdf functions

I want to know how scipy.stats uses its methods fit and pdf. According to the documentation, fit(data, a, loc = 0, scale = 1) estimates parameters for data while pdf(x, a, loc=0, scale=1) computes probability density function . But I couldn't find how fit and pdf are actually performed, statistically and mathematically.
I am using the sm.datasets.elnino data, using the code from tmthydvnprt
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())
y, x = np.histogram(data, bins = 50, density = True)
x = (x + np.roll(x, -1))[:-1] / 2.0
distribution = st.gennorm
params = distribution.fit(data)
arg = params[:-2]
loc = params[-2]
scale = params[-1]
pdf = distribution.pdf(x, loc = loc, scale = scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
Using data, arg = 4.3836, loc = 23.2991, scale = 3.8499.
I want to know what arg, loc, and scale represent and how they are calculated.
Thank you.

Fitting data to weibull distribution

I have a set of integer values, and I want to set them to Weibull distribution and get the best fit parameters. Then I draw the histogram of data together with the pdf of Weibull distribution, using the best fit parameters. This is the code I used.
from jtlHandler import *
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
def get_pdf(latencies):
a = np.array(latencies)
ag = st.gaussian_kde(a)
ak = np.linspace(np.min(a), np.max(a), len(a))
agv = ag(ak)
plt.plot(ak,agv)
plt.show()
return (ak,agv)
def fit_to_distribution(distribution, data):
params = distribution.fit(data)
# Return MLEs for shape (if applicable), location, and scale parameters from data.
#
# MLE stands for Maximum Likelihood Estimate. Starting estimates for the fit are given by input arguments; for any arguments not provided with starting estimates, self._fitstart(data) is called to generate such.
return params
def make_distribution_pdf(dist, params, end):
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Build PDF and turn into pandas Series
x = np.linspace(0, end, end)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)
return pdf
latencies = getLatencyList("filename")
latencies = latencies[int(9*(len(latencies)/10)):len(latencies)]
data = pd.Series(latencies)
params = fit_to_distribution(st.weibull_max, data)
print("Parameters for the fit: "+str(params))
# Make PDF
pdf = make_distribution_pdf(st.weibull_max, params, max(latencies))
# Display
plt.figure()
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=200, normed=True, alpha=0.5, label='Data',
legend=True, ax=ax)
ax.set_title('Weibull distribution')
ax.set_xlabel('Latnecy')
ax.set_ylabel('Frequency')
plt.savefig("image.png")
This is the resulting figure.
As it is seen, the Weibull approximation is not simmilar to the original distribution of data.
How can I get the best Weibull approximation to my data?
You can fit a data set (set of numbers) to any distribution using the following two methods.
import os
import matplotlib.pyplot as plt
import sys
import math
import numpy as np
import scipy.stats as st
from scipy.stats._continuous_distns import _distn_names
from scipy.optimize import curve_fit
def fit_to_distribution(distribution, latency_values):
distribution = getattr(st, distribution)
params = distribution.fit(latency_values)
return params
def make_distribution_pdf(distribution, latency_list):
distribution = getattr(st, distribution)
params = distribution.fit(latency_list)
arg = params[:-2]
loc = params[-2]
scale = params[-1]
x = np.linspace(min(latency_list), max(latency_list), 10000)
y = distribution.pdf(x, loc=loc, scale=scale, *arg)
return x, y

Calculating model evidence/marginals in Python

My question pertains to bayesian inference and how to numerically calculate model evidence given some data and a prior and a posterior distribution.
Given conjugate priors, the wikipedia article specifies model evidence as the following:
Where sigma and beta are parameters, m is the model, Y is the data and X is the prior.
Given the setup below, how do I calculate model evidence? I need something that returns one scalar number.
Below I have a minimal working example of generating some data (draws from a normal) and assuming a prior (a normal) and a likelihood function (a gaussian). Notice how both the PDF of the data and the prior integrate to (approximately) one, while the likelihood function can take values over 1.
I am mainly confused as to how to "integrate out" the parameters from the model, and thus take model complexity into consideration. I can see how this can be done analytically if you can write down the log-likelihood function. But can't really see how this can result in one scalar number.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy
import seaborn as sns
sns.set(style="white", palette="muted", color_codes=True)
%matplotlib inline
mu = 0
variance = 1
sigma = np.sqrt(variance)
data = np.random.normal(mu,variance,100)
x = np.linspace(-5,5,100)
density = scipy.stats.kde.gaussian_kde(data)
data_pdf = density(x)
prior_pdf = scipy.stats.norm.pdf(x, mu, sigma)
likelihood = np.exp(-np.power(x - mu, 2.) / (2 * np.power(sigma, 2.)))
I1=scipy.integrate.trapz(data_pdf,x)
I2=scipy.integrate.trapz(prior_pdf,x)
I3=scipy.integrate.trapz(likelihood,x)
fig1 = plt.figure(figsize=(7.5,5))
ax1 = fig1.add_subplot(3,1,1)
sns.despine(right=True)
ax1.plot(x,data_pdf,'k')
ax1.legend([r'$PDF(Data)$'],loc='upper left')
ax2 = fig1.add_subplot(3,1,2)
sns.despine(right=True)
ax2.plot(x,prior_pdf,'b')
ax2.legend([r'$Prior$'],loc='upper left')
ax3 = fig1.add_subplot(3,1,3)
sns.despine(right=True)
ax3.plot(x,likelihood,'r')
ax3.legend([r'$Likelihood$'],loc='upper left')
plt.tight_layout()
print(I1,I2,I3)

What is the source of discrepancy in 2D interpolated spectrogram with matplotlib?

I am trying to interpolate spectrogram obtained from matplotlib using scipy's inetrp2d function, but somehow fail to get the same spectrogram. The data is available here
The actual spectrogram is:
And interpolated spectrogram is:
The code looks okay, but even then something is wrong. The code used is:
from __future__ import division
from matplotlib import ticker as mtick
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
import numpy as np
from bisect import bisect
from scipy import interpolate
from matplotlib.ticker import MaxNLocator
data = np.genfromtxt('spectrogram.dat', skiprows = 2, delimiter = ',')
pressure = data[:, 1] * 0.065
time = data[:, 0]
cax = plt.specgram(pressure * 100000, NFFT = 256, Fs = 50000, noverlap=4, cmap=plt.cm.gist_heat, zorder = 1)
f = interpolate.interp2d(cax[2], cax[1], cax[0], kind='cubic')
xnew = np.linspace(cax[2][0], cax[2][-1], 100)
ynew = np.linspace(cax[1][0], cax[1][-1], 100)
znew = 10 * np.log10(f(xnew, ynew))
fig = plt.figure(figsize=(6, 3.2))
ax = fig.add_subplot(111)
ax.set_title('colorMap')
plt.pcolormesh(xnew, ynew, znew, cmap=plt.cm.gist_heat)
# plt.colorbar()
plt.title('Interpolated spectrogram')
plt.colorbar(orientation='vertical')
plt.savefig('interp_spectrogram.pdf')
How to interpolate a spectrogram correctly with Python?
The key to your solution is in this warning, which you may or may not have seen:
RuntimeWarning: invalid value encountered in log10
znew = 10 * np.log10(f(xnew, ynew))
If your data is actually a power whose log you'd like to view explicitly as decibel power, take the log first, before fitting to the spline:
spectrum, freqs, t, im = cax
dB = 10*np.log10(spectrum)
#f = interpolate.interp2d(t, freqs, dB, kind='cubic') # docs for this recommend next line
f = interpolate.RectBivariateSpline(t, freqs, dB.T) # but this uses xy not ij, hence the .T
xnew = np.linspace(t[0], t[-1], 10*len(t))
ynew = np.linspace(freqs[0], freqs[-1], 10*len(freqs)) # was it wider spaced than freqs on purpose?
znew = f(xnew, ynew).T
Then plotting as you have:
Previous answer:
If you just want to plot on logscale, use matplotlib.colors.LogNorm
znew = f(xnew, ynew) # Don't take the log here
plt.figure(figsize=(6, 3.2))
plt.pcolormesh(xnew, ynew, znew, cmap=plt.cm.gist_heat, norm=colors.LogNorm())
And that looks like this:
Of course that still has gaps where its value is negative when plotted on a log scale. What your data means to you when the value is negative should dictate how you fill this in. One simple solution is to just set those values to the smallest positive value and they'd fill in as black:

Categories