Calculate poisson probability percentage

Calculate poisson probability percentage - python

When you use the POISSON function in Excel (or in OpenOffice Calc), it takes two arguments:
an integer
an 'average' number
and returns a float.
In Python (I tried RandomArray and NumPy) it returns an array of random poisson numbers.
What I really want is the percentage that this event will occur (it is a constant number and the array has every time different numbers - so is it an average?).
for example:
print poisson(2.6,6)
returns [1 3 3 0 1 3] (and every time I run it, it's different).
The number I get from calc/excel is 3.19 (POISSON(6,2.16,0)*100).
Am I using the python's poisson wrong (no pun!) or am I missing something?

scipy has what you want
>>> scipy.stats.distributions
<module 'scipy.stats.distributions' from '/home/coventry/lib/python2.5/site-packages/scipy/stats/distributions.pyc'>
>>> scipy.stats.distributions.poisson.pmf(6, 2.6)
array(0.031867055625524499)
It's worth noting that it's pretty easy to calculate by hand, too.

It is easy to do by hand, but you can overflow doing it that way. You can do the exponent and factorial in a loop to avoid the overflow:
def poisson_probability(actual, mean):
# naive: math.exp(-mean) * mean**actual / factorial(actual)
# iterative, to keep the components from getting too large or small:
p = math.exp(-mean)
for i in xrange(actual):
p *= mean
p /= i+1
return p

This page explains why you get an array, and the meaning of the numbers in it, at least.

Related

How to iterate through the Cartesian product of ten lists (ten elements each) faster? (Probability and Dice)

I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..

I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.

If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.

How to calculate numbers with large exponents

I was writing a program where I need to calculate insanely huge numbers.
k = int(input())
print(int((2**k)*5 % (10**9 + 7))
Here, k being of the orders of 109
As expected, this was rather slow( taking upto 5 seconds to calculate) whereas my program needs to finish computing in 1 second.
After a little research online I found a function pow(), and by writing
p = 10**9 + 7
print(int(pow(2, k- 1,p)*10))
This works fine for small numbers but messes up at large numbers. I can understand why that is happening( because this isn't essentially what I want to calculate and the modulus operation with such a large number doesn't affect the calculation with small values of k).
I also found libraries like gmpy2 and numpy but I don't know how to use them since I'm just a beginner with python.
So how can I write an expression for what I want to calculate and which works fast enough and doesn't err at large numbers too?

You can optimize your operation by passing the number you want to take modulus from as the third argument of builtin pow and multiplying the result by 5
def func(k):
x = pow(2, k, pow(10,9) + 7) * 5
return int(x)

Integer optimization/maximization in numpy

I need to estimate the size of a population, by finding the value of n which maximises scipy.misc.comb(n, a)/n**b where a and b are constants. n, a and b are all integers.
Obviously, I could just have a loop in range(SOME_HUGE_NUMBER), calculate the value for each n and break out of the loop once I reach an inflexion in the curve. But I wondered if there was an elegant way of doing this with (say) numpy/scipy, or is there some other elegant way of doing this just in pure Python (e.g. like an integer equivalent of Newton's method?)

As long as your number n is reasonably small (smaller than approx. 1500), my guess for the fastest way to do this is to actually try all possible values. You can do this quickly by using numpy:
import numpy as np
import scipy.misc as misc
nMax = 1000
a = 77
b = 100
n = np.arange(1, nMax+1, dtype=np.float64)
val = misc.comb(n, a)/n**b
print("Maximized for n={:d}".format(int(n[val.argmax()]+0.5)))
# Maximized for n=181
This is not especially elegant but rather fast for that range of n. Problem is that for n>1484 the numerator can already get too large to be stored in a float. This method will then fail, as you will run into overflows. But this is not only a problem of numpy.ndarray not working with python integers. Even with them, you would not be able to compute:
misc.comb(10000, 1000, exact=True)/10000**1001
as you want to have a float result in your division of two numbers larger than the maximum a float in python can hold (max_10_exp = 1024 on my system. See sys.float_info().). You couldn't use your range in that case, as well. If you really want to do something like that, you will have to take more care numerically.

You essentially have a nicely smooth function of n that you want to maximise. n is required to be integral but we can consider the function instead to be a function of the reals. In this case, the maximising integral value of n must be close to (next to) the maximising real value.
We could convert comb to a real function by using the gamma function and use numerical optimisation techniques to find the maximum. Another approach is to replace the factorials with Stirling's approximation. This gives a moderately complicated but tractable algebraic expression. This expression is not hard to differentiate and set to zero to find the extrema.
I did this and obtained
n * (b + (n-a) * log((n-a)/n) ) = a * b - a/2
This is not straightforward to solve algebraically but easy enough numerically (e.g. using Newton's method, as you suggest).
I may have made a mistake in the algebra, but I typed the a = 77, b = 100 example into Wolfram Alpha and got 180.58 so the approach seems to work.

Algorithm to calculate point at which to round values in an array up or down in order to least affect the mean

Consider array random array of values between 0 and 1 such as:
[0.1,0.2,0.8,0.9]
is there a way to calculate the point at which the values should be rounded down or up to an integer in order to match the mean of the un-rounded array the closest? (in above case it would be at the mean but that is purely a coincidence)
or is it just trial and error?
im coding in python
thanks for any help

Add them up, then round the sum. That's how many 1s you want. Round so you get that many 1s.
def rounding_point(l):
# if the input is sorted, you don't need the following line
l = sorted(l)
ones_needed = int(round(sum(l)))
# this may require adjustment if there are duplicates in the input
return 1.0 if ones_needed == len(l) else l[-ones_needed]
If sorting the list turns out to be too expensive, you can use a selection algorithm like quickselect. Python doesn't come with a quickselect function built in, though, so don't bother unless your inputs are big enough that the asymptotic advantage of quickselect outweighs the constant factor advantage of the highly-optimized C sorting algorithm.

Sum of Square Differences (SSD) in numpy/scipy

I'm trying to use Python and Numpy/Scipy to implement an image processing algorithm. The profiler tells me a lot of time is being spent in the following function (called often), which tells me the sum of square differences between two images
def ssd(A,B):
s = 0
for i in range(3):
s += sum(pow(A[:,:,i] - B[:,:,i],2))
return s
How can I speed this up? Thanks.

Just
s = numpy.sum((A[:,:,0:3]-B[:,:,0:3])**2)
(which I expect is likely just sum((A-B)**2) if the shape is always (,,3))
You can also use the sum method: ((A-B)**2).sum()
Right?

Just to mention that one can also use np.dot:
def ssd(A,B):
dif = A.ravel() - B.ravel()
return np.dot( dif, dif )
This might be a bit faster and possibly more accurate than alternatives using np.sum and **2, but doesn't work if you want to compute ssd along a specified axis. In that case, there might be a magical subscript formula using np.einsum.

I am confused why you are taking i in range(3). Is that supposed to be the whole array, or just part?
Overall, you can replace most of this with operations defined in numpy:
def ssd(A,B):
squares = (A[:,:,:3] - B[:,:,:3]) ** 2
return numpy.sum(squares)
This way you can do one operation instead of three and using numpy.sum may be able to optimize the addition better than the builtin sum.

Further to Ritsaert Hornstra's answer that got 2 negative marks (admittedly I didn't see it in it's original form...)
This is actually true.
For a large number of iterations it can often take twice as long to use the '**' operator or the pow(x,y) method as to just manually multiply the pairs together. If necessary use the math.fabs() method if it's throwing out NaN's (which it sometimes does especially when using int16s etc.), and it still only takes approximately half the time of the two functions given.
Not that important to the original question I know, but definitely worth knowing.

I do not know if the pow() function with power 2 will be fast. Try:
def ssd(A,B):
s = 0
for i in range(3):
s += sum((A[:,:,i] - B[:,:,i])*(A[:,:,i] - B[:,:,I]))
return s

You can try this one:
dist_sq = np.sum((A[:, np.newaxis, :] - B[np.newaxis, :, :]) ** 2, axis=-1)
More details can be found here (the 'k-Nearest Neighbors' example):
https://jakevdp.github.io/PythonDataScienceHandbook/02.08-sorting.html

In Ruby language you can achieve this in this way
def diff_btw_sum_of_squars_and_squar_of_sum(from=1,to=100) # use default values from 1..100.
((1..100).inject(:+)**2) -(1..100).map {|num| num ** 2}.inject(:+)
end
diff_btw_sum_of_squars_and_squar_of_sum #call for above method

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.