I have a set of raw data and I have to identify the distribution of that data. What is the easiest way to plot a probability distribution function? I have tried fitting it in normal distribution.
But I am more curious to know which distribution does the data carry within itself ?
I have no code to show my progress as I have failed to find any functions in python that will allow me to test the distribution of the dataset. I do not want to slice the data and force it to fit in may be normal or skew distribution.
Is any way to determine the distribution of the dataset ? Any suggestion appreciated.
Is this any correct approach ? Example
This is something close what I am looking for but again it fits the data into normal distribution. Example
EDIT:
The input has million rows and the short sample is given below
Hashtag,Frequency
#Car,45
#photo,4
#movie,6
#life,1
The frequency ranges from 1 to 20,000 count and I am trying to identify the distribution of the frequency of the keywords. I tried plotting a simple histogram but I get the output as a single bar.
Code:
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('Paris_random_hash.csv', sep=',')
plt.hist(df['Frequency'])
plt.show()
Output
This is a minimal working example for showing a histogram. It only solves part of your question, but it can be a step towards your goal. Note that the histogram function gives you the values at the two corners of the bin and you have to interpolate to get the center value.
import numpy as np
import matplotlib.pyplot as pl
x = np.random.randn(10000)
nbins = 20
n, bins = np.histogram(x, nbins, density=1)
pdfx = np.zeros(n.size)
pdfy = np.zeros(n.size)
for k in range(n.size):
pdfx[k] = 0.5*(bins[k]+bins[k+1])
pdfy[k] = n[k]
pl.plot(pdfx, pdfy)
You can fit your data using the example shown in:
Fitting empirical distribution to theoretical ones with Scipy (Python)?
Definitely a stats question - sounds like you're trying to do a probability test of whether the distribution is significantly similar to the normal, lognormal, binomial, etc. distributions. The easiest is to test for normal or lognormal as explained below.
Set your Pvalue cutoff, usually if your Pvalue <= 0.05 then it is NOT normally distributed.
In Python use SciPy, you just need your P value returned to test, so 2 return values from this function (I'm ignoring optional (not needed) inputs here for clarity):
import scipy.stats
[W, Pvalue] = scipy.stats.morestats.shapiro(x)
Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
If you want to see if it is lognormally distributed (provided it doesn't pass the P test above), you can try:
import numpy
[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))
Interpret the same way - I just tested on a known lognormally distributed simulation and got a 0.17 Pvalue on the np.log(x) test, and a number close to 0 for the standard shapiro(x) test. That tells me lognormally distributed is the better choice, normally distributed fails miserably.
I made it simple which is what I gathered you are looking for. For other distributions, you may need to do more work along the lines of Q-Q plots https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot and not simply following a few tests I proposed. That means you have a plot of the distribution you are trying to fit to vs. your data plotted. Here's a quick example that can get you down that path if you so desire:
import numpy as np
import pylab
import scipy.stats as stats
mydata = whatever data you are looking to fit to a distribution
stats.probplot(mydata, dist='norm', plot=pylab)
pylab.show()
Above you can substitute anything for dist='norm' from the scipy library http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-stats
then find its scipy name (must add shape parameters according to the documentation such as stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab) or for student T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)), then look at the plot and see how close your data follows that distribution. If the data points are close you've found your distribution. It will give you an R^2 value too on the graph; closer to 1 the better the fit generally.
And if you want to continue trying to do what you're doing with the dataframe, try changing to: plt.hist(df['Frequency'].values)
Please vote for this answer if it answers your question :) Need some bounty to get replies on my own programming dilemmas.
Did you try using the seaborn library? They have a nice kernel density estimation function. Try:
import seaborn as sns
sns.kdeplot(df['frequency'])
You find installation instructions here
The only distribution the data carry within itself is the empirical probability. If your have data as a 1d numpy array data you can compute the value of the empirical distribution function at x as the cumulative relative frequency of the values lesser than or equal to x:
d[d <= x].size / d.size
This is a step function so it does not have an associated probability density function but a probability mass function where the mass of each observed value is its relative frequency. To compute the relative frequencies:
values, freqs = np.unique(data, return_counts=True)
rfreqs = freqs / data.size
This does not mean that the data is a random sample from their empirical distribution. If you want to know what distribution your data are a sample from (if any) just by looking at the data, the answer is you can't. But that is more about statistics than about programming.
The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically
I think you are asking a slightly different question:
What is the correlation between my raw data and the curve to which I have mapped it?
This is a conceptual problem, and you're are trying to understand the meanings of the values R and R squared. Start by working through this MiniTab blog post. You may want to skim this non-Python Kaledia Graph Guide to understand the classes of curves to fit and the usage of Least-Mean-Squares in fitting the curves.
You were probably downvoted because it is a math question more than a programming question.
I may be missing something, but it seems that a major point is being overlooked across the board: The data set you are describing is a categorical data set. That is, the x-values are not numeric, they're just words (#Car, #photo, etc.). The concept of a the shape of a probability distribution is meaningless for a categorical data set, since there is no logical ordering for the categories. What would a histogram even look like? Would #Car be the first bin? Or would it be all the way to the right of your graph? Unless you have some criteria for quantifying your categories then trying to make judgments based on the shape of the distribution is meaningless.
Here's a small text-based example to clarify what I'm saying. Suppose I survey a group of people and ask their favorite color. I plot the results:
Red | ##
Green | #####
Blue | #######
Yellow | #####
Orange | ##
Huh, looks like color preferences are normally distributed. Wait, what if I had randomly put the colors in a different order in my graph:
Blue | #######
Yellow | #####
Green | #####
Orange | ##
Red | ##
I guess the data is actually positively skewed? Not so, of course - for a categorical data set the shape of the distribution is meaningless. Only if you were to decide to some how quantify each hashtag in your data set would the problem have meaning. Do you want to compare the length of a hashtag to its frequency? Or the alphabetical ordering of a hashtag to its frequency? Etc.
Related
I would like to generate random numbers using a truncated Maxwell-Boltzmann distribution. I know that scipy has built-in Maxwell random variables, but there is no truncated version of it (I am also aware of a truncated normal distribution, which is irrelevant here). I have tried to write my own random variables using rvs_continuous:
import scipy.stats as st
class maxwell_boltzmann_pdf(st.rv_continuous):
def _pdf(self,x):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(x))*np.exp(-np.square(x/v_0))*np.heaviside(v_esc-x,0)
maxwell_boltzmann_cv = maxwell_boltzmann_pdf(a=0, b=v_esc, name='maxwell_boltzmann_pdf')
This does exactly what I want, but it is way too slow for my purpose (I am doing Monte Carlo simulations), even if I draw all the random velocities outside of all the loops. I have also thought of using Inverse transform sampling method, but the inverse of the CDF does not have an analytic form and I will need to do a bisection for every number I draw. It would be great if there is a convenient way for me to generate random numbers from a truncated Maxwell-Boltzmann distribution with decent speed.
There are several things you can do here.
For fixed parameters v_esc and v_0, n_0 is a constant, so it doesn't need to be calculated in the pdf method.
If you define only a PDF for a SciPy rv_continuous subclass, then the class's rvs, mean, and so on will be very slow, presumably because the method needs to integrate the PDF every time it needs to generate a random variate or calculate a statistic. If speed is at a premium, you will thus need to add to maxwell_boltzmann_pdf an _rvs method that uses its own sampler. (See also this question.) One possible method is the rejection sampling method: Draw a number in a box until the box falls within the PDF. It works for any bounded PDF with a finite domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). See this question for example Python code.
If you know the distribution's CDF, then there are some additional tricks. One of them is the relatively new k-vector sampling method for sampling a continuous distribution. There are two phases: a setup phase and a sampling phase. The setup phase involves approximating the CDF's inverse via root finding, and the sampling phase uses this approximation to generate random numbers that follow the distribution in a very fast way without having to further evaluate the CDF. For a fixed distribution like this one, if you show me the CDF, I can precalculate the necessary data and the code needed to sample the distribution using that data. Essentially, the only non-trivial part of k-vector sampling is the root-finding step.
More information on sampling from an arbitrary distribution is found on my sampling methods page.
It turns out that there is a way to generate a truncated Maxwell-Boltzmann distribution with the inverse transform sampling method using the ppf feature of scipy. I am posting the code here for future reference.
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import erf
from scipy.stats import maxwell
# parameters
v_0 = 220 #km/s
v_esc = 550 #km/s
N = 10000
# CDF(v_esc)
cdf_v_esc = maxwell.cdf(v_esc,scale=v_0/np.sqrt(2))
# pdf for the distribution
def f_MB(v_mag):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(v_mag))*np.exp(-np.square(v_mag/v_0))*np.heaviside(v_esc-v_mag,0)
# plot the pdf
x = np.arange(600)
y = [f_MB(i) for i in x]
plt.plot(x,y,label='pdf')
# use inverse transform sampling to get the truncated Maxwell-Boltzmann distribution
sample = maxwell.ppf(np.random.rand(N)*cdf_v_esc,scale=v_0/np.sqrt(2))
# plot the histogram of the samples
plt.hist(sample,bins=100,histtype='step',density=True,label='histogram')
plt.xlabel('v_mag')
plt.legend()
plt.show()
This code generates the required random variables and compare its histogram with the analytic form of the pdf, which matches with each other pretty well.
I've looked at a bunch of examples on here and tried using snippets of other codes, but they're not working for me. I have 4 data sets, but I'll include just one here. My professor told me that the data appeared to be Poisson distributed, so I am trying to fit a Poisson to a histogram of the data. Here is my code:
######## Poisson fit ########
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.special import factorial
data = data59[4]
entries,bin_edges,patches = plt.hist(data59[4],60,[1,10],normed=True)
bin_middles = 0.5*(bin_edges[1:]+bin_edges[:-1])
def poisson(k, lamb):
return np.exp(-lamb)*(lamb**k)/factorial(k)
popt,pcov = curve_fit(poisson,bin_middles,entries)
x = np.linspace(1,10,100)
plt.plot(x,poisson(x,*popt))
plt.show()
I tried plotting other distributions on top of the histogram like normal and Rayleigh using scipy.stats instead of curve_fit. Those kind of worked only because they have a scale parameter, which scipy.stats.poisson doesn't. The distribution for this comes out looking exactly the same as the curve_fit. I'm not sure how to resolve this issue. Perhaps the data is not even Poisson distributed!
Thanks for helping!!
Update: The data is IceCube data from the TXS 0506+056 blazar. I used SkyDrive to get a URL for the file. I hope it works. The first column is the modified Julian day and the last column is the log of the energy proxy. I am using the last column. I have null and alternative hypotheses surrounding this data and am using maximum likelihood estimation (from a certain distribution, Poisson in my first case) to analyze the data.
Also, here is where I got the data: https://icecube.wisc.edu/science/data/TXS0506_point_source
The data presented in your histogram does not have a Poisson distribution. The Poisson is a counting distribution (what's the probability of getting 0, 1, 2,... observations per unit of time or space), its support is the positive integers. Your histogram clearly shows that you have fractional values, since the spikes are different non-zero heights at non-integer locations.
I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.
I have values sampled from a continuous distribution, for example:
import numpy as np
values = np.random.normal(loc=0.4, scale=0.1, 1000)
How can I estimate the mode based on those values ?
The mean and median are easy to compute: np.mean(values) and np.median(values); but for the mode I don't know how estimate it, since the values are continuous.
Note that using something like scipy.stats.mode would not work because I have a finite set of values sampled from the continuous distribution.
If you have a known, underlying parametric model, life is easy. Fit your data (using MLE or whatever) and take the mode of the fitted distribution. If you don't have a good parametric model, life is harder. There are a number of things I've seen in the literature, but I don't know if any sort of consensus has been reached on this. When I had to do this (~20 years ago) I used an algorithm I found in Numerical Recipes in C. I have no idea if that was the best choice or not.
For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an πR implementation of this
(ππππππππππ::ππππππ‘()robustbase::adjbox()) as well as
a matlab one (in a library called πππππlibra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..
In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.