Random integers from an exponential distribution between min and max - python

I would like to generate random integers on an interval min to max. For a uniform distribution in numpy:
numpy.random.randint(min,max,n)
does exactly what I want.
However, I would now like to give the distribution of random numbers an exponential bias. There are a number of suggestions for this e.g. Pseudorandom Number Generator - Exponential Distribution as well as the numpy function numpy.random.RandomState.exponential, but these do not address how to constrain the distribution to integers between min and max. I'm not sure how to do this, whilst still ensuring a random distribution.

The exponential distribution is a continuous distribution. What you probably want is its discrete equivalent, the geometric distribution. Numpy's implementation generates strictly positive integers, i.e, 1,2,3,..., so you'll want add min-1 to shift it, and then truncate by rejecting/throwing away results > max. That, in turn, means generating them one-by-one add adding the non-rejected values to a list until you get the desired number. (You could also determine analytically what proportion you expect to be rejected, and scale your n accordingly, but you'll still likely end up a few short or with a few too many.)
It's possible to do this without rejection, but you'd have to create your own inversion, determine the probability of exceeding max, and generate uniforms to between 0 and that probability to feed to your inversion algorithm. Rejection is simpler even though it's less efficient.

May be you can try summing up all the bias. Then the probability of generating an integer j= bias of j / total bias. You can use monte carlo simulation to implement this.

Related

Probability of gettting 0.0 from random()?

I'm trying to make a simple program to demonstrate something, though I'm a bit preplexed on the math of it.
from random import random
a = random()
I read up on the random function and its distribution is [0.0;1.0). It uses Mersenne Twister to generate pseudo randoms and it's a 56bit precision floating number.
I'm assuming that means that the probability of it generating exactly 0.0 is 1/2^56?
What would a have to be lower than, in order for the probability to be 1/2^28?.. I tried understanding the 56-bit float conversion but I can't seem to figure it out. What would the actual float value have to be?
a = ?
if random() < a:
print("Success")
With a continuous uniform distribution over [0, 1), the portion of samples less than x is x. For example, ½ the samples are less than ½. So the x such that the probability that a sample is less than x is 1/228 is 1/228.
With a quantized distribution (only multiples of a certain quantum are in the distribution) over [0, 1), the same is true if x is a number in the distribution. If it is between two numbers in the distribution, the probability a sample is less than x is the number just greater than x. However, in the situation you describe, it seems like 1/228 is in the distribution, and so it is the answer.
It depends on how it was generated. Almost all libraries use the equidistant method to produce values on [0,1). Briefly:
Generate uniform integer (say 64-bits returned per call)
Throw away the number of excess bits to match the floating-point precision (24 for single, 53 for doubles)
Convert integer to float (no rounding occurs since value "fits") and scale to range (2^-24/2^-53)
So (taking doubles) the method produces 2^53 unique FP values and each occurs with probability of 2^-53. The number of bits of the underlying integer generator doesn't effect this.

Why does numpy.random.normal gives some negative value in the ndarray?

I know that the normal distribution is always greater than 0 for any chosen value of the mean and the standard deviation.
>> np.random.normal(scale=0.3, size=x.shape)
[ 0.15038925 -0.34161875 -0.07159422 0.41803414 0.39900799 0.10714512
0.5770597 -0.16351734 0.00962916 0.03901677]
Here the mean is 0.0 and the standard deviation is 0.3. But some values in the ndarray are negative. Am I wrong in my interpretation that normal distribution curve is always positive?
Edit:
But using normpdf function in matlab always give an array of positive values which I guess is the probability density function (y axis). Whereas numpy.random.normal gives both positive and negative values (x axis). Now this is confusing.
Values generated from a Normal distribution does take negative value.
For example, for a mean 0 normal distribution. We need some positive values and negative values for the average value to be zero. Also, for the normal distribution with mean 0, it is equally likely to be positive or negative.
It actually take any real number with positive probability. You might be confused with the probability density function is always positive.
referencing to np.random.normal in "https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html", the output is the sample (x), not the distribution (y). Therefore, the output can be negative.
Therefore, np.random.normal is used to do the sampling by following the normal distribution, not to randomly generate a probability value by following the normal distribution.
Try to not expect probability mean as 0, as it makes no sense, you expecting your random event never to occur.
Try to use something like np.random.normal(0.5, 0.3, 1000) to express your normal probability distribution.
Also, take a closer look at the math of Normal Distribution to be able to construct your probability density functions easily.

How can i apply Chebyshev's inequality to this case?

I have the a dataframe which includes heights. The data can not go below zero. That's why i can not use standard deviation as this data is not a normal distribution. I can not use 68-95-99.7 rule here because it fails in my case. Here is my dataframe, mean and SD.
0.77132064
0.02075195
0.63364823
0.74880388
0.49850701
0.22479665
0.19806286
0.76053071
0.16911084
0.08833981
Mean: 0.41138725956196015
Std: 0.2860541519582141
If I get 2 std, as you can see the number becomes negative.
-2 x std calculation = 0.41138725956196015 - 0.2860541519582141 x 2 = -0,160721044354468
I have tried using percentile and not satisfied with it to be honest. How can i apply Chebyshev's inequality to this problem? Here what i did so far:
np.polynomial.Chebyshev(df['Heights'])
But this returns numbers not a SD level i can measure. Or do you think Chebyshev is the best choice in my case?
Expected solution:
I am expecting to get a range like 75% next height will be between 0.40 - 0.43 etc.
EDIT1: Added histogram
To be more clear, I have added my real data's histogram
EDIT2: Some values from real data
Mean: 0.007041500928135767
Percentile 50: 0.0052000000000000934
Percentile 90: 0.015500000000000047
Std: 0.0063790857035425025
Var: 4.06873389299246e-05
Thanks a lot
You seem to be confusing two ideas from the same mathematician, Chebyshev. These ideas are not the same.
Chebysev's inequality states a fact that is true for many probability distributions. For two standard deviations, it states that three-fourths of the data items will lie within two standard deviations from the mean. As you state, for normal distributions about 19/20 of the items will lie in that interval, but Chebyshev's inequality is an absolute bound that is met by practically all distributions. The fact that your data values are never negative does not change the truth of the inequality; it just makes the actual proportion of values in the interval even larger, so the inequality is even more true (in a sense).
Chebyshev polynomials do not involve statistics, but are simply a series (or two series) of polynomials, commonly used in calculating approximations for computer functions. That is what np.polynomial.Chebyshev involves, and therefore does not seem useful to you at all.
So calculate Chebyshev's inequality yourself. There is no need for a special function for that, since it is so easy (this is Python 3 code):
def Chebyshev_inequality(num_std_deviations):
return 1 - 1 / num_std_deviations**2
You can change that to handle the case where k <= 1 but the idea is obvious.
In your particular case: the inequality says that at least 3/4, or 75%, of the data items will lie within 2 standard deviations of the mean, which means more than 0.41138725956196015 - 2 * 0.2860541519582141 and less than than 0.41138725956196015 + 2 * 0.2860541519582141 (note the different signs), which simplifies to the interval
[-0.16072104435446805, 0.9834955634783884]
In your data, 100% of your data values are in that interval, so Chebyshev's inequality was correct (of course).
Now, if your goal is to predict or estimate where a certain percentile is, Chebyshev's inequality does not help much. It is an absolute lower bound, so it gives one limit to a percentile. For example, by what we did above we know that the 12.5'th percentile is at or above -0.16072104435446805 and the 87.5'th percentile is at or below 0.9834955634783884. Those facts are true but are probably not what you want. If you want an estimate that is closer to the actual percentile, this is not the way to go. The 68-95-99.7 rule is an estimate--the actual locations may be higher or lower, but if the distribution is normal than the estimate will not be far off. Chebyshev's inequality does not do that kind of estimate.
If you want to estimate the 12.5'th and 87.5'th percentiles (showing where 75 percent of all the population will fall) you should calculate those percentiles of your sample and use those values. If you don't know more details about the kind of distribution you have, I don't see any better way. There are reasons why normal distributions are so popular!
It sounds like you want the boundaries for the middle 75% of your data.
The middle 75% of the data is between the 12.5th percentile and the 87.5th percentile, so you can use the quantile function to get the values at the locations:
[df['Heights'].quantile(0.5 - 0.75/2), df['Heights'].quantile(0.5 + 0.75/2)]
#[0.09843618875, 0.75906485625]
As per What does it mean when the standard deviation is higher than the mean? What does that tell you about the data? - Quora, SD is a measure of "spread" and mean is a measure of "position". As you can see, these are more or less independent things. Now, if all your samples are positive, SD cannot be greater than the mean because of the way it's calculated, but 2 or 3 SDs very well can.
So, basically, SD being roughly equal to the mean means that your data are all over the place.
Now, a random variable that's strictly positive indeed cannot be normally distributed. But for a rough estimation, seeing that you still have a bell shape, we can pretend it is and still use SD as a rough measure of the spread (though, since 2 and 3 SD can go into negatives, they lack any physical meaning here whatsoever and so are unusable for the sake of our pretention):
E.g. to get a rough prediction of grass growth, you can still take the mean and apply whatever growth model you're using to it -- that will get the new, prospective mean. Then applying the same to mean±SD will give an idea of the new SD.
This is very rough, of course. But to get any better, you'll have to somehow check which distribution you're dealing with and use its peak and spread characteristics instead of mean and SD. And in any case, your prediction will not be any better than your growth model -- studies of which are anything but conclusive judging by e.g. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3040.2005.01490.x (not a single formula there).

Why does numpy std() give a different result to matlab std()?

I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?
The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.
The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).
For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.

Random Number Generator Explanation

from random import *
def main():
t = 0
for i in range(1000): # thousand
t += random()
print(t/1000)
main()
I was looking at the source code for a sample program my professor gave me and I came across this RNG. can anyone explain how this RNG works?
If you plotted the points, you would see that this actually produces a Gaussian ("normal") distribution about the mean of the random function.
Generate random numbers following a normal distribution in C/C++ talks about random number generation; it's a pretty common technique to do this if all you have is a uniform number generator like in standard C.
What I've given you here is a histogram of 100,000 values drawn from your function (of course, returned not printed, if you aren't familiar with python). The y axis is the frequency that the value appears, the x axis is the bin of the value. As you can see, the average value is 1/2, and by 3 standard deviations (99.7 percent of the data) we have almost no values in the range. That should be intuitive; we "usually" get 1/2, and very rarely get .99999
Have a look at the documentation. Its quite well written:
https://docs.python.org/2/library/random.html
The idea is that that program generates a random number 1000 times which is sufficiently enough to get mean as 0.5
The program is using the Central Limit Theorem - sums of independent and identically distributed random variables X with finite variance asymptotically converge to a normal (a.k.a. Gaussian) distribution whose mean is the sum of the means, and variance is the sum of the variances. Scaling this by N, the number of X's summed, gives the sample mean (a.k.a. average). If the expected value of X is μ and the variance of X is σ2, the expected value of the sample mean is also μ and it has variance σ2 / N.
Since a Uniform(0,1) has mean 0.5 and variance 1/12, your algorithm will generate results that are pretty close to normally distributed with a mean of 0.5 and a variance of 1/12000. Consequently 99.7% of the outcomes should fall within +/-3 standard deviations of the mean, i.e., in the range 0.5+/-0.0274.
This is a ridiculously inefficient way to generate normals. Better alternatives include the Box-Muller method, Polar method, or ziggurat method.
The thing making this random is the random() function being called. random() will generate 1 (for most practical purposes) random float between 0 and 1.
>>>random()
0.1759916412898097
>>>random()
0.5489228122596088
etc.
The rest of it is just adding each random to a total and then dividing by the number of randoms, essentially finding the average of all 1000 randoms, which as Cyber pointed out is actually not a random number at all.

Categories