Can't fit Poisson to histogram in - python

I've looked at a bunch of examples on here and tried using snippets of other codes, but they're not working for me. I have 4 data sets, but I'll include just one here. My professor told me that the data appeared to be Poisson distributed, so I am trying to fit a Poisson to a histogram of the data. Here is my code:
######## Poisson fit ########
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.special import factorial
data = data59[4]
entries,bin_edges,patches = plt.hist(data59[4],60,[1,10],normed=True)
bin_middles = 0.5*(bin_edges[1:]+bin_edges[:-1])
def poisson(k, lamb):
return np.exp(-lamb)*(lamb**k)/factorial(k)
popt,pcov = curve_fit(poisson,bin_middles,entries)
x = np.linspace(1,10,100)
plt.plot(x,poisson(x,*popt))
plt.show()
I tried plotting other distributions on top of the histogram like normal and Rayleigh using scipy.stats instead of curve_fit. Those kind of worked only because they have a scale parameter, which scipy.stats.poisson doesn't. The distribution for this comes out looking exactly the same as the curve_fit. I'm not sure how to resolve this issue. Perhaps the data is not even Poisson distributed!
Thanks for helping!!
Update: The data is IceCube data from the TXS 0506+056 blazar. I used SkyDrive to get a URL for the file. I hope it works. The first column is the modified Julian day and the last column is the log of the energy proxy. I am using the last column. I have null and alternative hypotheses surrounding this data and am using maximum likelihood estimation (from a certain distribution, Poisson in my first case) to analyze the data.
Also, here is where I got the data: https://icecube.wisc.edu/science/data/TXS0506_point_source

The data presented in your histogram does not have a Poisson distribution. The Poisson is a counting distribution (what's the probability of getting 0, 1, 2,... observations per unit of time or space), its support is the positive integers. Your histogram clearly shows that you have fractional values, since the spikes are different non-zero heights at non-integer locations.

Related

Getting random numbers from a truncated Maxwell-Boltzmann distribution

I would like to generate random numbers using a truncated Maxwell-Boltzmann distribution. I know that scipy has built-in Maxwell random variables, but there is no truncated version of it (I am also aware of a truncated normal distribution, which is irrelevant here). I have tried to write my own random variables using rvs_continuous:
import scipy.stats as st
class maxwell_boltzmann_pdf(st.rv_continuous):
def _pdf(self,x):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(x))*np.exp(-np.square(x/v_0))*np.heaviside(v_esc-x,0)
maxwell_boltzmann_cv = maxwell_boltzmann_pdf(a=0, b=v_esc, name='maxwell_boltzmann_pdf')
This does exactly what I want, but it is way too slow for my purpose (I am doing Monte Carlo simulations), even if I draw all the random velocities outside of all the loops. I have also thought of using Inverse transform sampling method, but the inverse of the CDF does not have an analytic form and I will need to do a bisection for every number I draw. It would be great if there is a convenient way for me to generate random numbers from a truncated Maxwell-Boltzmann distribution with decent speed.
There are several things you can do here.
For fixed parameters v_esc and v_0, n_0 is a constant, so it doesn't need to be calculated in the pdf method.
If you define only a PDF for a SciPy rv_continuous subclass, then the class's rvs, mean, and so on will be very slow, presumably because the method needs to integrate the PDF every time it needs to generate a random variate or calculate a statistic. If speed is at a premium, you will thus need to add to maxwell_boltzmann_pdf an _rvs method that uses its own sampler. (See also this question.) One possible method is the rejection sampling method: Draw a number in a box until the box falls within the PDF. It works for any bounded PDF with a finite domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). See this question for example Python code.
If you know the distribution's CDF, then there are some additional tricks. One of them is the relatively new k-vector sampling method for sampling a continuous distribution. There are two phases: a setup phase and a sampling phase. The setup phase involves approximating the CDF's inverse via root finding, and the sampling phase uses this approximation to generate random numbers that follow the distribution in a very fast way without having to further evaluate the CDF. For a fixed distribution like this one, if you show me the CDF, I can precalculate the necessary data and the code needed to sample the distribution using that data. Essentially, the only non-trivial part of k-vector sampling is the root-finding step.
More information on sampling from an arbitrary distribution is found on my sampling methods page.
It turns out that there is a way to generate a truncated Maxwell-Boltzmann distribution with the inverse transform sampling method using the ppf feature of scipy. I am posting the code here for future reference.
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import erf
from scipy.stats import maxwell
# parameters
v_0 = 220 #km/s
v_esc = 550 #km/s
N = 10000
# CDF(v_esc)
cdf_v_esc = maxwell.cdf(v_esc,scale=v_0/np.sqrt(2))
# pdf for the distribution
def f_MB(v_mag):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(v_mag))*np.exp(-np.square(v_mag/v_0))*np.heaviside(v_esc-v_mag,0)
# plot the pdf
x = np.arange(600)
y = [f_MB(i) for i in x]
plt.plot(x,y,label='pdf')
# use inverse transform sampling to get the truncated Maxwell-Boltzmann distribution
sample = maxwell.ppf(np.random.rand(N)*cdf_v_esc,scale=v_0/np.sqrt(2))
# plot the histogram of the samples
plt.hist(sample,bins=100,histtype='step',density=True,label='histogram')
plt.xlabel('v_mag')
plt.legend()
plt.show()
This code generates the required random variables and compare its histogram with the analytic form of the pdf, which matches with each other pretty well.

Comparing models fitting multivariate data

I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.

Most accurate way to interpolate data and find the peak?

The data I have is always on a second degree polynomial (quadratic function). I want to find the peak of the interpolated function as accurately as possible.
So far I've been using interp1d and then extract the peak value using linspace and a simple for loop. Although you can use a large number of newly generated samples in linspace you can still be more precise using the derivative of the fitted polynomial. I haven't found a way to do that using interp1d.
Now the only function I've found that returns the fitted polynomial coefficients is polyfit, but this fitted function is quite inaccurate (most of the time the function doesn't even go through the data points).
I've tried using UnivariateSpline and the fitted function seems to be quite accurate and it's very simple to get the derivative spline and its root.
Other polynomial fitting functions (BarycentricInterpolator, KroghInterpolator, ...) state that they are not computing polynomial coefficients for reasons of numerical stability.
How accurate is UnivariateSpline and its derivatives, or are there any better options out there?
If all you need is to find the min/max of a second degree polynomial why not do this:
import matplotlib.pyplot as plt
from scipy.interpolate import KroghInterpolator
import numpy as np
x=range(-20,20)
y=[]
for i in x:
y.append((i**2)+25)
x=x[1::5]
y=y[1::5]
f=KroghInterpolator(x,y)
xfine=np.arange(min(x),max(x),.5)
yfine=f(xfine)
val_interp=min(yfine)
print val_interp
plt.scatter(x,y)
plt.plot(xfine, yfine)
plt.show()
In the end I went with polyfit. Although the fitted function didn't go exactly through the data points the end result was still good. From the returned coefficients I got the desired x and y coordinates of the peak.

Probability Distribution Function Python

I have a set of raw data and I have to identify the distribution of that data. What is the easiest way to plot a probability distribution function? I have tried fitting it in normal distribution.
But I am more curious to know which distribution does the data carry within itself ?
I have no code to show my progress as I have failed to find any functions in python that will allow me to test the distribution of the dataset. I do not want to slice the data and force it to fit in may be normal or skew distribution.
Is any way to determine the distribution of the dataset ? Any suggestion appreciated.
Is this any correct approach ? Example
This is something close what I am looking for but again it fits the data into normal distribution. Example
EDIT:
The input has million rows and the short sample is given below
Hashtag,Frequency
#Car,45
#photo,4
#movie,6
#life,1
The frequency ranges from 1 to 20,000 count and I am trying to identify the distribution of the frequency of the keywords. I tried plotting a simple histogram but I get the output as a single bar.
Code:
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('Paris_random_hash.csv', sep=',')
plt.hist(df['Frequency'])
plt.show()
Output
This is a minimal working example for showing a histogram. It only solves part of your question, but it can be a step towards your goal. Note that the histogram function gives you the values at the two corners of the bin and you have to interpolate to get the center value.
import numpy as np
import matplotlib.pyplot as pl
x = np.random.randn(10000)
nbins = 20
n, bins = np.histogram(x, nbins, density=1)
pdfx = np.zeros(n.size)
pdfy = np.zeros(n.size)
for k in range(n.size):
pdfx[k] = 0.5*(bins[k]+bins[k+1])
pdfy[k] = n[k]
pl.plot(pdfx, pdfy)
You can fit your data using the example shown in:
Fitting empirical distribution to theoretical ones with Scipy (Python)?
Definitely a stats question - sounds like you're trying to do a probability test of whether the distribution is significantly similar to the normal, lognormal, binomial, etc. distributions. The easiest is to test for normal or lognormal as explained below.
Set your Pvalue cutoff, usually if your Pvalue <= 0.05 then it is NOT normally distributed.
In Python use SciPy, you just need your P value returned to test, so 2 return values from this function (I'm ignoring optional (not needed) inputs here for clarity):
import scipy.stats
[W, Pvalue] = scipy.stats.morestats.shapiro(x)
Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
If you want to see if it is lognormally distributed (provided it doesn't pass the P test above), you can try:
import numpy
[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))
Interpret the same way - I just tested on a known lognormally distributed simulation and got a 0.17 Pvalue on the np.log(x) test, and a number close to 0 for the standard shapiro(x) test. That tells me lognormally distributed is the better choice, normally distributed fails miserably.
I made it simple which is what I gathered you are looking for. For other distributions, you may need to do more work along the lines of Q-Q plots https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot and not simply following a few tests I proposed. That means you have a plot of the distribution you are trying to fit to vs. your data plotted. Here's a quick example that can get you down that path if you so desire:
import numpy as np
import pylab
import scipy.stats as stats
mydata = whatever data you are looking to fit to a distribution
stats.probplot(mydata, dist='norm', plot=pylab)
pylab.show()
Above you can substitute anything for dist='norm' from the scipy library http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-stats
then find its scipy name (must add shape parameters according to the documentation such as stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab) or for student T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)), then look at the plot and see how close your data follows that distribution. If the data points are close you've found your distribution. It will give you an R^2 value too on the graph; closer to 1 the better the fit generally.
And if you want to continue trying to do what you're doing with the dataframe, try changing to: plt.hist(df['Frequency'].values)
Please vote for this answer if it answers your question :) Need some bounty to get replies on my own programming dilemmas.
Did you try using the seaborn library? They have a nice kernel density estimation function. Try:
import seaborn as sns
sns.kdeplot(df['frequency'])
You find installation instructions here
The only distribution the data carry within itself is the empirical probability. If your have data as a 1d numpy array data you can compute the value of the empirical distribution function at x as the cumulative relative frequency of the values lesser than or equal to x:
d[d <= x].size / d.size
This is a step function so it does not have an associated probability density function but a probability mass function where the mass of each observed value is its relative frequency. To compute the relative frequencies:
values, freqs = np.unique(data, return_counts=True)
rfreqs = freqs / data.size
This does not mean that the data is a random sample from their empirical distribution. If you want to know what distribution your data are a sample from (if any) just by looking at the data, the answer is you can't. But that is more about statistics than about programming.
The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically
I think you are asking a slightly different question:
What is the correlation between my raw data and the curve to which I have mapped it?
This is a conceptual problem, and you're are trying to understand the meanings of the values R and R squared. Start by working through this MiniTab blog post. You may want to skim this non-Python Kaledia Graph Guide to understand the classes of curves to fit and the usage of Least-Mean-Squares in fitting the curves.
You were probably downvoted because it is a math question more than a programming question.
I may be missing something, but it seems that a major point is being overlooked across the board: The data set you are describing is a categorical data set. That is, the x-values are not numeric, they're just words (#Car, #photo, etc.). The concept of a the shape of a probability distribution is meaningless for a categorical data set, since there is no logical ordering for the categories. What would a histogram even look like? Would #Car be the first bin? Or would it be all the way to the right of your graph? Unless you have some criteria for quantifying your categories then trying to make judgments based on the shape of the distribution is meaningless.
Here's a small text-based example to clarify what I'm saying. Suppose I survey a group of people and ask their favorite color. I plot the results:
Red | ##
Green | #####
Blue | #######
Yellow | #####
Orange | ##
Huh, looks like color preferences are normally distributed. Wait, what if I had randomly put the colors in a different order in my graph:
Blue | #######
Yellow | #####
Green | #####
Orange | ##
Red | ##
I guess the data is actually positively skewed? Not so, of course - for a categorical data set the shape of the distribution is meaningless. Only if you were to decide to some how quantify each hashtag in your data set would the problem have meaning. Do you want to compare the length of a hashtag to its frequency? Or the alphabetical ordering of a hashtag to its frequency? Etc.

How to create dataset for fitting a function in scipy stats?

I want to fit some data to a Pareto distribution using the scipy.stats library. I am not sure if the issue might be numerical, so just to be safe; I have values measured for the dependent variable (let's call them 'pushes') for the independent variable ('minutes') starting at a few thousand minutes and every ten minutes thereafter (with the exception of a few points that were removed during data cleaning).
e.g.
2780.0 362.0
2800.0 376.0
2810.0 393.0
...
The best info I can find says something like
from scipy.stats import pareto
result = pareto.fit(data)
and I have no idea how this data is to be formatted in this case. I've tried the following but all result in errors.
result = pareto.fit(zip(minutes, pushes))
result = pareto.fit(pushes)
The error is usually
Warning: invalid value encountered in double_scalars
would greatly appreciate some guidance, thank you.
As I mentioned in the comments above, pareto.fit() is not what you're looking for.
The .fit() methods of the continuous distributions in scipy.stats obtain an estimate of the parameters of the distribution that maximise the probability of observing some particular set of sample values. Therefore, pareto.fit() wants only a single array argument containing the samples you want to fit the distribution to. The other keyword arguments control various aspects of the fitting process, for example by specifying initial values for the distribution parameters.
What you're actually trying to do is to fit the relationship between some independent variable x and some dependent variable y, i.e.
y_fit = f(x, params)
What you need to do is:
Choose some functional form for f. From your description, the plot of y vs x resembles the probability density function for a Pareto distribution, so perhaps either this or a decaying exponential might be appropriate.
Find the set of params that minimize some measure of the difference between y and y_fit (usually the sum of squared differences). You could use scipy.optimize.curve_fit or scipy.optimize.minimize to do this.

Categories