I would like to generate random numbers using a truncated Maxwell-Boltzmann distribution. I know that scipy has built-in Maxwell random variables, but there is no truncated version of it (I am also aware of a truncated normal distribution, which is irrelevant here). I have tried to write my own random variables using rvs_continuous:
import scipy.stats as st
class maxwell_boltzmann_pdf(st.rv_continuous):
def _pdf(self,x):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(x))*np.exp(-np.square(x/v_0))*np.heaviside(v_esc-x,0)
maxwell_boltzmann_cv = maxwell_boltzmann_pdf(a=0, b=v_esc, name='maxwell_boltzmann_pdf')
This does exactly what I want, but it is way too slow for my purpose (I am doing Monte Carlo simulations), even if I draw all the random velocities outside of all the loops. I have also thought of using Inverse transform sampling method, but the inverse of the CDF does not have an analytic form and I will need to do a bisection for every number I draw. It would be great if there is a convenient way for me to generate random numbers from a truncated Maxwell-Boltzmann distribution with decent speed.
There are several things you can do here.
For fixed parameters v_esc and v_0, n_0 is a constant, so it doesn't need to be calculated in the pdf method.
If you define only a PDF for a SciPy rv_continuous subclass, then the class's rvs, mean, and so on will be very slow, presumably because the method needs to integrate the PDF every time it needs to generate a random variate or calculate a statistic. If speed is at a premium, you will thus need to add to maxwell_boltzmann_pdf an _rvs method that uses its own sampler. (See also this question.) One possible method is the rejection sampling method: Draw a number in a box until the box falls within the PDF. It works for any bounded PDF with a finite domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). See this question for example Python code.
If you know the distribution's CDF, then there are some additional tricks. One of them is the relatively new k-vector sampling method for sampling a continuous distribution. There are two phases: a setup phase and a sampling phase. The setup phase involves approximating the CDF's inverse via root finding, and the sampling phase uses this approximation to generate random numbers that follow the distribution in a very fast way without having to further evaluate the CDF. For a fixed distribution like this one, if you show me the CDF, I can precalculate the necessary data and the code needed to sample the distribution using that data. Essentially, the only non-trivial part of k-vector sampling is the root-finding step.
More information on sampling from an arbitrary distribution is found on my sampling methods page.
It turns out that there is a way to generate a truncated Maxwell-Boltzmann distribution with the inverse transform sampling method using the ppf feature of scipy. I am posting the code here for future reference.
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import erf
from scipy.stats import maxwell
# parameters
v_0 = 220 #km/s
v_esc = 550 #km/s
N = 10000
# CDF(v_esc)
cdf_v_esc = maxwell.cdf(v_esc,scale=v_0/np.sqrt(2))
# pdf for the distribution
def f_MB(v_mag):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(v_mag))*np.exp(-np.square(v_mag/v_0))*np.heaviside(v_esc-v_mag,0)
# plot the pdf
x = np.arange(600)
y = [f_MB(i) for i in x]
plt.plot(x,y,label='pdf')
# use inverse transform sampling to get the truncated Maxwell-Boltzmann distribution
sample = maxwell.ppf(np.random.rand(N)*cdf_v_esc,scale=v_0/np.sqrt(2))
# plot the histogram of the samples
plt.hist(sample,bins=100,histtype='step',density=True,label='histogram')
plt.xlabel('v_mag')
plt.legend()
plt.show()
This code generates the required random variables and compare its histogram with the analytic form of the pdf, which matches with each other pretty well.
Related
There are a few methods I have come across that can do kernel density estimation which will provide a PDF for a sample of data:
KDEpy
sklearn.neighbors.KernelDensity
scipy.stats.gaussian_kde
Using any of the above I can generate a PDF however I want to know how I can get the CDF for the PDF I am generating. In math I know you can integrate on the PDF to get the CDF, however the issue is that these methods are only supplying x and y points and not a function to integrate on.
I'm wondering how I could transform the data being given into a CDF plot or alternatively find the PDF function for the data to then integrate on to get the CDF. Or use an alternative method where the output is a CDF instead of PDF.
MCVE
Let's create some dummy data to shoulder the discussion:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(123)
data = stats.norm(loc=0, scale=1).rvs(10**4)
Here is the baseline idea with the scipy.stats package.
Gaussian KDE
We can estimate KDE using dedicated tools such as gaussian_kde:
kde = stats.gaussian_kde(data)
Which exposes a PDF function to evaluate at every x but is missing the CDF.
Checking samples with Kolmogorov-Smirnov Test we cannot reject the null hypothesis (two distributions are identical) with the threshold of 10%:
stats.ks_2samp(data, kde.resample(100).squeeze())
# KstestResult(statistic=0.0969, pvalue=0.29163373800871994)
Continuous Variable
The scipy.stats package also exposes a generic class rv_continous to inherit from. As stated in documentation:
New random variables can be defined by subclassing the rv_continuous
class and re-defining at least the _pdf or the _cdf method (normalized
to location 0 and scale 1).
So we can use this on purpose logic to fill the gap. Without any performance consideration it boils down to:
class KDEDist(stats.rv_continuous):
def __init__(self, kde, *args, **kwargs):
super().__init__(*args, **kwargs)
self._kde = kde
def _pdf(self, x):
return self._kde.pdf(x)
Then we create the underlying object with our experimental KDE.
X = KDEDist(kde)
stats.ks_2samp(data, X.rvs(size=100)) # This call is kind of intensive
# KstestResult(statistic=0.0625, pvalue=0.8113077271721811)
Now we can naturally - at least in term of API call - evaluate the PDF and CDF as well:
fig, axe = plt.subplots()
axe.hist(data, density=1)
axe.plot(x, X.pdf(x))
axe.plot(x, X.cdf(x))
It returns:
Performance considerations
Notice than this methodology answers your question but is not performant. KDE computation are expensive mainly because the kernel spans the whole data space (Gaussian reaches zero at infinity). Therefore, without cut-off feature computations are based on all observations of the dataset at each evaluation.
Changing the window function can drastically improve the performance. Eg.: triangular window will have fixed span over the whole dataset and reduce computation w.r.t. dataset extent and size.
Implementation considerations
Reading the doc, it seems rv_continuous is initially designed to implement new continuous variable with analytical definition.
Anyway, the class provides automatic resolution/integration for other statistics if underlying methods are not implemented (overridden).
When choosing this methodology, it is up to you to implement missing logic if you wish to make it more performant and robust (numerical stability).
Histogram instead of KDE
If you can relax the KDE needs and is satisfied by an histogram distribution, then you can rely on rv_histogram which essentially does the same based on the binned distribution:
hist = np.histogram(data, bins=100)
hist_dist = stats.rv_histogram(hist)
stats.ks_2samp(data, hist_dist.rvs(size=100))
# KstestResult(statistic=0.0577, pvalue=0.8778871545532821)
Which returns:
KDE Histogram
Provided it is acceptable theoretically, we can mix both strategy by creating the expected histogram from the KDE:
hist = np.histogram(data, bins=1000)
hist_kde = kde.pdf(hist[1][:-1] + np.diff(hist[1]))
hist_dist_kde = stats.rv_histogram([hist_kde, hist[1]])
stats.ks_2samp(data, hist_dist_kde.rvs(size=100))
# KstestResult(statistic=0.1067, pvalue=0.19541766226890545)
Then the CDF has a relative smoothness w.r.t. the KDE (it is still an histogram) and the Continuous Variable object is as performant as rv_histogram can be.
I've looked at a bunch of examples on here and tried using snippets of other codes, but they're not working for me. I have 4 data sets, but I'll include just one here. My professor told me that the data appeared to be Poisson distributed, so I am trying to fit a Poisson to a histogram of the data. Here is my code:
######## Poisson fit ########
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.special import factorial
data = data59[4]
entries,bin_edges,patches = plt.hist(data59[4],60,[1,10],normed=True)
bin_middles = 0.5*(bin_edges[1:]+bin_edges[:-1])
def poisson(k, lamb):
return np.exp(-lamb)*(lamb**k)/factorial(k)
popt,pcov = curve_fit(poisson,bin_middles,entries)
x = np.linspace(1,10,100)
plt.plot(x,poisson(x,*popt))
plt.show()
I tried plotting other distributions on top of the histogram like normal and Rayleigh using scipy.stats instead of curve_fit. Those kind of worked only because they have a scale parameter, which scipy.stats.poisson doesn't. The distribution for this comes out looking exactly the same as the curve_fit. I'm not sure how to resolve this issue. Perhaps the data is not even Poisson distributed!
Thanks for helping!!
Update: The data is IceCube data from the TXS 0506+056 blazar. I used SkyDrive to get a URL for the file. I hope it works. The first column is the modified Julian day and the last column is the log of the energy proxy. I am using the last column. I have null and alternative hypotheses surrounding this data and am using maximum likelihood estimation (from a certain distribution, Poisson in my first case) to analyze the data.
Also, here is where I got the data: https://icecube.wisc.edu/science/data/TXS0506_point_source
The data presented in your histogram does not have a Poisson distribution. The Poisson is a counting distribution (what's the probability of getting 0, 1, 2,... observations per unit of time or space), its support is the positive integers. Your histogram clearly shows that you have fractional values, since the spikes are different non-zero heights at non-integer locations.
I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.
I want to generate a random (gaussian) tensor symmetric with respect to all the permutations of the axes. In the end I want all the entries with the same distribution, so tricks like summing over all the permutation and rescaling by sqrt(k!), where k is the order of my tensor, don't work. eg:
import numpy as np
from itertools import permutations
noise_buffer = np.random.normal(size=n*n*n).reshape(n,n,n)/np.sqrt(6);
noise = np.zeros([n,n,n]);
for i in permutations([0,1,2]):
noise += np.transpose(noise_buffer,axes=list(i))
I could loop over all the coordinates (-1) and rescale opportunely, but this is time consuming.
Do you know any library where this is implemented? or do you know any fast implementation?
Actually, the method you present works; it just needs a small modification.
Use the fact that the sum of normal random variables, is another normal random variable with the variances summed, e.g. see here
The problem, it seems, was that you want the entries of noise to come from a certain distribution, but if you take noise_buffer from that distribution, then noise will have a different distribution.
The solution is to use a different distribution (specifically, normal, with standard deviation reduced by a factor of 1/sqrt(k!) ) for noise_buffer, then noise will have the right distribution.
I have a set of raw data and I have to identify the distribution of that data. What is the easiest way to plot a probability distribution function? I have tried fitting it in normal distribution.
But I am more curious to know which distribution does the data carry within itself ?
I have no code to show my progress as I have failed to find any functions in python that will allow me to test the distribution of the dataset. I do not want to slice the data and force it to fit in may be normal or skew distribution.
Is any way to determine the distribution of the dataset ? Any suggestion appreciated.
Is this any correct approach ? Example
This is something close what I am looking for but again it fits the data into normal distribution. Example
EDIT:
The input has million rows and the short sample is given below
Hashtag,Frequency
#Car,45
#photo,4
#movie,6
#life,1
The frequency ranges from 1 to 20,000 count and I am trying to identify the distribution of the frequency of the keywords. I tried plotting a simple histogram but I get the output as a single bar.
Code:
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('Paris_random_hash.csv', sep=',')
plt.hist(df['Frequency'])
plt.show()
Output
This is a minimal working example for showing a histogram. It only solves part of your question, but it can be a step towards your goal. Note that the histogram function gives you the values at the two corners of the bin and you have to interpolate to get the center value.
import numpy as np
import matplotlib.pyplot as pl
x = np.random.randn(10000)
nbins = 20
n, bins = np.histogram(x, nbins, density=1)
pdfx = np.zeros(n.size)
pdfy = np.zeros(n.size)
for k in range(n.size):
pdfx[k] = 0.5*(bins[k]+bins[k+1])
pdfy[k] = n[k]
pl.plot(pdfx, pdfy)
You can fit your data using the example shown in:
Fitting empirical distribution to theoretical ones with Scipy (Python)?
Definitely a stats question - sounds like you're trying to do a probability test of whether the distribution is significantly similar to the normal, lognormal, binomial, etc. distributions. The easiest is to test for normal or lognormal as explained below.
Set your Pvalue cutoff, usually if your Pvalue <= 0.05 then it is NOT normally distributed.
In Python use SciPy, you just need your P value returned to test, so 2 return values from this function (I'm ignoring optional (not needed) inputs here for clarity):
import scipy.stats
[W, Pvalue] = scipy.stats.morestats.shapiro(x)
Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
If you want to see if it is lognormally distributed (provided it doesn't pass the P test above), you can try:
import numpy
[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))
Interpret the same way - I just tested on a known lognormally distributed simulation and got a 0.17 Pvalue on the np.log(x) test, and a number close to 0 for the standard shapiro(x) test. That tells me lognormally distributed is the better choice, normally distributed fails miserably.
I made it simple which is what I gathered you are looking for. For other distributions, you may need to do more work along the lines of Q-Q plots https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot and not simply following a few tests I proposed. That means you have a plot of the distribution you are trying to fit to vs. your data plotted. Here's a quick example that can get you down that path if you so desire:
import numpy as np
import pylab
import scipy.stats as stats
mydata = whatever data you are looking to fit to a distribution
stats.probplot(mydata, dist='norm', plot=pylab)
pylab.show()
Above you can substitute anything for dist='norm' from the scipy library http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-stats
then find its scipy name (must add shape parameters according to the documentation such as stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab) or for student T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)), then look at the plot and see how close your data follows that distribution. If the data points are close you've found your distribution. It will give you an R^2 value too on the graph; closer to 1 the better the fit generally.
And if you want to continue trying to do what you're doing with the dataframe, try changing to: plt.hist(df['Frequency'].values)
Please vote for this answer if it answers your question :) Need some bounty to get replies on my own programming dilemmas.
Did you try using the seaborn library? They have a nice kernel density estimation function. Try:
import seaborn as sns
sns.kdeplot(df['frequency'])
You find installation instructions here
The only distribution the data carry within itself is the empirical probability. If your have data as a 1d numpy array data you can compute the value of the empirical distribution function at x as the cumulative relative frequency of the values lesser than or equal to x:
d[d <= x].size / d.size
This is a step function so it does not have an associated probability density function but a probability mass function where the mass of each observed value is its relative frequency. To compute the relative frequencies:
values, freqs = np.unique(data, return_counts=True)
rfreqs = freqs / data.size
This does not mean that the data is a random sample from their empirical distribution. If you want to know what distribution your data are a sample from (if any) just by looking at the data, the answer is you can't. But that is more about statistics than about programming.
The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically
I think you are asking a slightly different question:
What is the correlation between my raw data and the curve to which I have mapped it?
This is a conceptual problem, and you're are trying to understand the meanings of the values R and R squared. Start by working through this MiniTab blog post. You may want to skim this non-Python Kaledia Graph Guide to understand the classes of curves to fit and the usage of Least-Mean-Squares in fitting the curves.
You were probably downvoted because it is a math question more than a programming question.
I may be missing something, but it seems that a major point is being overlooked across the board: The data set you are describing is a categorical data set. That is, the x-values are not numeric, they're just words (#Car, #photo, etc.). The concept of a the shape of a probability distribution is meaningless for a categorical data set, since there is no logical ordering for the categories. What would a histogram even look like? Would #Car be the first bin? Or would it be all the way to the right of your graph? Unless you have some criteria for quantifying your categories then trying to make judgments based on the shape of the distribution is meaningless.
Here's a small text-based example to clarify what I'm saying. Suppose I survey a group of people and ask their favorite color. I plot the results:
Red | ##
Green | #####
Blue | #######
Yellow | #####
Orange | ##
Huh, looks like color preferences are normally distributed. Wait, what if I had randomly put the colors in a different order in my graph:
Blue | #######
Yellow | #####
Green | #####
Orange | ##
Red | ##
I guess the data is actually positively skewed? Not so, of course - for a categorical data set the shape of the distribution is meaningless. Only if you were to decide to some how quantify each hashtag in your data set would the problem have meaning. Do you want to compare the length of a hashtag to its frequency? Or the alphabetical ordering of a hashtag to its frequency? Etc.