Long tail distribution of random numbers in Python - python

I need to make a randomizing function in Python returning values using long tail distribution. Unfortunately, my math skills are nowhere near my programming skills so I'm stuck.
This is the kind of distribution I'm looking for:
(source: danvk.org)
Returned value must be between 0 and 1, and it must be possible to assign a peak value (where the graph peaks on the Y axis), which would be a number between 0 and 1.
Example usage:
def random_long_tail(peak):
#magic
value = random_long_tail(0.2)
print(value) #outputs i.e. 0.345811242
I will be incredibly grateful for any help in solving this issue. Thank you!

There are quite a few distributions with single peak value and some tail, log-normal, Gamma, Chi2 to name a few.
Typically, one can pick out numpy random module and see what's avalable and how they fit into your problem. Link: http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.random.html

You might look into Numpy's random sampling options. See which ones are included in the list of common heavy-tailed distributions.
The log-normal is a nice example. Numpy allows you to specify the mean and standard deviation. You will have to do a bit of algebra to "assign a peak value". This peak is called the "mode".

Related

Python: How to discretize continuous probability distributions for Kullback-Leibler Divergence

I want to find out how many samples are needed at minimum to more or less correctly fit a probability distribution (In my case the Generalized Extreme Value Distribution from scipy.stats).
In order to evaluate the matched function, I want to compute the KL-divergence between the original function and the fitted one.
Unfortunately, all implementations I found (e.g. scipy.stats.entropy) only take discrete arrays as input. So obviously I thought of approximating the pdf by a discrete array, but I just can't seem to figure it out.
Does anyone have experience with something similar? I would be thankful for hints relating directly to my question, but also for better alternatives to determine a distance between two functions in python, if there are any.

Total Variation Distance for continuous distributions in Python(or R)

I would like to calculate the total variation distance(TVD) between two continuous probability distributions. I would like to point out that while there are two relevant questions(see here and here), they are both working with discrete distributions.
For those not familiar with TVD,
Informally, this is the largest possible difference between the
probabilities that the two probability distributions can assign to the
same event.
as it is described in the respective Wikipedia page. In the case of continuous distributions, TVD is equal with half the integral of the absolute difference between the two (since I cannot add math notation see this for a proof and for the notation).
So far, I wasn't able to find a tool for my job in Python. I would be interested in one if exists. Also, while I have no experience in R, I understand that is commonly used for such tasks so I would be interested in one as well (TVD calculation is the final step of my algorithm so I guess it won't be hard to read some data from a file, do the calculation and print a number even if I am completely new to R).
I would like to add that I am mainly interesting in normal distributions so a tool strictly for those is more than welcomed.
If no such tools exist, then any help adapting answers from this question to use the builtin probability functions will be of great help as well.
Thank you in advance.

How do you compute the expected value of an infinite distribution? (particularly in python)

I was trying to compute the expected value of a distribution (assume I know the parameters or I can estimate them) but it might be a distribution over a sample space that is infinite. Is there a library (for example in python, numpy or something) that is able to compute such an expected value with reasonable speed and accuracy?
For an arbitrary distribution it seemed hard, but the only thoughts I had was, if it was normally, then we can approximate this by adding small enough chunks in cap where the probability is highly concentrated or something...but I wanted to do something less ad-hoc and more established, since I am sure I am not the first one to try to compute an expected value in a computer.
Having a probability space with infinite support is not uncommon.
The normal or t distribution have support over the real line, the Poisson distribution is over all positive integers.
The distribution in scipy.stats implement an expect method, which in the continuous case just uses scipy.integrate.quad, and in the discrete case uses expanding summation with some heuristic stopping criterion.
This works quite well with well behaved functions but can run into problems in some cases, like shifted support of the function or fat tails.
variance of standard normal:
>>> from scipy import stats
>>> stats.norm.expect(lambda x: x**2)
1.000000000000001
variance of Poisson:
>>> stats.poisson.expect(lambda x: (x - 5)**2, args=(5,))
4.9999999999999973

python scipy.stats pdf and expect functions

I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise
rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.
The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))

Find a random method that best fit list of values

I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.

Categories