I'm looking to compute something like:
Where f(i) is a function that returns a real number in [-1,1] for any i in {1,2,...,5000}.
Obviously, the result of the sum is somewhere in [-1,1], but when I can't seem to be able to compute it in Python using straight forward coding, as 0.55000 becomes 0 and comb(5000,2000) becomes inf, which result in the computed sum turning into NaN.
The required solution is to use log on both sides.
That is using the identity a × b = 2log(a) + log(b), if I could compute log(a) and log(b) I could compute the sum, even if a is big and b is almost 0.
So I guess what I'm asking is if there's an easy way of computing
log2(scipy.misc.comb(5000,2000))
So I could compute my sum simply by
sum([2**(log2comb(5000,i)-5000) * f(i) for i in range(1,5000) ])
#abarnert's solution, while working for the 5000 figure addresses the problem by increasing the precision in which the comb is computed. This works for this example, but doesn't scale, as the memory required would significantly increase if instead of 5000 we had 1e7 for example.
Currently, I'm using a workaround which is ugly, but keeps memory consumption low:
log2(comb(5000,2000)) = sum([log2 (x) for x in 1:5000])-sum([log2 (x) for x in 1:2000])-sum([log2 (x) for x in 1:3000])
Is there a way of doing so in a readable expression?
The sum
is the expectation of f with respect to a binomial distribution with n = 5000 and p = 0.5.
You can compute this with scipy.stats.binom.expect:
import scipy.stats as stats
def f(i):
return i
n, p = 5000, 0.5
print(stats.binom.expect(f, (n, p), lb=0, ub=n))
# 2499.99999997
Also note that as n goes to infinity, with p fixed, the binomial distribution approaches the normal distribution with mean np and variance np*(1-p). Therefore, for large n you can instead compute:
import math
print(stats.norm.expect(f, loc=n*p, scale=math.sqrt((n*p*(1-p))), lb=0, ub=n))
# 2500.0
EDIT: #unutbu has answered the real question, but I'll leave this here in case log2comb(n, k) is useful to anyone.
comb(n, k) is n! / ((n-k)! k!), and n! can be computed using the Gamma function gamma(n+1). Scipy provides the function scipy.special.gamma. Scipy also provides gammaln, which is the log (natural log, that is) of the Gamma function.
So log(comb(n, k)) can be computed as gammaln(n+1) - gammaln(n-k+1) - gammaln(k+1)
For example, log(comb(100, 8)) (after executing from scipy.special import gammaln):
In [26]: log(comb(100, 8))
Out[26]: 25.949484949043022
In [27]: gammaln(101) - gammaln(93) - gammaln(9)
Out[27]: 25.949484949042962
and log(comb(5000, 2000)):
In [28]: log(comb(5000, 2000)) # Overflow!
Out[28]: inf
In [29]: gammaln(5001) - gammaln(3001) - gammaln(2001)
Out[29]: 3360.5943053174142
(Of course, to get the base-2 logarithm, just divide by log(2).)
For convenience, you can define:
from math import log
from scipy.special import gammaln
def log2comb(n, k):
return (gammaln(n+1) - gammaln(n-k+1) - gammaln(k+1)) / log(2)
By default, comb gives you a float64, which overflows and gives you inf.
But if you pass exact=True, it gives you a Python variable-sized int instead, which can't overflow (unless you get so ridiculously huge you run out of memory).
And, while you can't use np.log2 on an int, you can use Python's math.log2.
So:
math.log2(scipy.misc.comb(5000, 2000, exact=True))
As an alternative, you relative that n choose k is defined as n!k / k!, right? You can reduce that to ∏(i=1...k)((n+1-i)/i), which is simple to compute.
Or, if you want to avoid overflow, you can do it by alternating * (n-i) and / (k-i).
Which, of course, you can also reduce to adding and subtracting logs. I think looping in Python and computing 4000 logarithms is going to be slower than looping in C and computing 4000 multiplications, but we can always vectorize it, and then, it might be faster. Let's write it and test:
In [1327]: n, k = 5000, 2000
In [1328]: %timeit math.log2(scipy.misc.comb(5000, 2000, exact=True))
100 loops, best of 3: 1.6 ms per loop
In [1329]: %timeit np.log2(np.arange(n-k+1, n+1)).sum() - np.log2(np.arange(1, k+1)).sum()
10000 loops, best of 3: 91.1 µs per loop
Of course if you're more concerned with memory instead of time… well, this obviously makes it worse. We've got 2000 8-byte floats instead of one 608-byte integer at a time. And if you go up to 100000, 20000, you get 20000 8-byte floats instead of one 9K integer. And at 1000000, 200000, it's 200000 8-byte floats vs. one 720K integer.
I'm not sure why either way is a problem for you. Especially given that you're using a listcomp instead of a genexpr, and therefore creating an unnecessary list of 5000, 100000, or 1000000 Python floats—24MB is not a problem, but 720K is? But if it is, we can obviously just do the same thing iteratively, at the cost of some speed:
r = sum(math.log2(n-i) - math.log2(k-i) for i in range(n-k))
This isn't too much slower than the scipy solution, and it never uses more than a small constant number of bytes (a handful of Python floats). (Unless you're on Python 2, in which case… just use xrange instead of range and it's back to constant.)
As a side note, why are you using a list comprehension instead of an NumPy array with vectorized operations (for speed, and also a bit of compactness) or a generator expression instead of a list comprehension (for no memory usage at all, at no cost to speed)?
Related
Have a look at this image:
In my application I receive from an iterator an arbitrary amount (let's say 1000 for now) of big 1-dimensional arrays arr1, arr2, arr3, ..., arr1000 (10000 entries each). Each entry is an integer between 0 and n, where in this case n = 9. My ultimate goal is to compute a 1-dimensional array result such that result[i] == the mode of arr1[i], arr2[i], arr3[i], ..., arr1000[i].
However, it is not tractable to concatenate the arrays to one big matrix and then compute the mode row-wise, since this may exceed the RAM on my machine.
An alternative would be to set up an array res2 of shape (10000, 10), then loop through every array, use each entry e as index and then to increase the value of res2[i][e] by 1. Alter looping, I would apply something like argmax. However, this is too slow.
So: Is the a way to perform the task in a fast way, maybe by using NumPy's advanced indexing?
EDIT (due to the comments):
This is basically the code which calculates the modes row-wise – avoiding to concatenate the arrays:
def foo(length, n):
counts = np.zeros((length, n), dtype=np.int_)
for arr in array_iterator():
i = 0
for e in arr:
counts[i][e] += 1
i += 1
return np.argmax(counts, axis=1)
It takes already 60 seconds for 100 arrays of size 10000 (although there is more work done behind the scenes, which results into that time – however, this work scales linearly with the amount of arrays).
Regarding the real sizes:
The amount of different arrays is really arbitrary. It's a parameter of experiments and I'd like to have the opportunity even to set this to values like 10^6. The length of each array is depending of my data set I'm working with. This could be 10000, or 100000 or even worse. However – spitting this into smaller pieces may be possible, though annoying.
My free RAM for this task is about 4 GB.
EDIT 2:
The running time I gave above leads to a wrong impression. Actually, the running time which just belongs to the inner loop (for e in arr) in the above mentioned scenario is just 5 seconds – which is now ok for me, since it's negligible compared to the remaining running time. I will leave this question open anyway for a moment, since there might be an even faster method waiting out there.
Since the following expansion for the logarithm holds:
log(1-x)=-x-x^2/2-x^3/3-...
one can calculate the following functions which have removable singularities at x:
log(1-x)/x=-1-x/2-...
(log(1-x)/x+1)/x=-1/2-x/3-...
((log(1-x)/x+1)/x+1/2)/x=-1/3-x/4-...
I am trying to use NumPy for these calculations, and specifically the log1p function, which is accurate near x=0. However, convergence for the aforementioned functions is still problematic.
Do you have any ideas for any existing functions implementing these formulas or should I write one myself using the previous expansions, which will not be as efficient, however?
The simplest thing to do is something like
In [17]: def logf(x, eps=1e-6):
...: if abs(x) < eps:
...: return -0.5 - x/3.
...: else:
...: return (1. + log1p(-x)/x)/x
and play a bit with the threshold eps.
If you want a numpy-like, vectorized solution, replace an if with a np.where
>>> np.where(x > eps, 1. + log1p(-x)/x) / x, -0.5 - x/3.)
Why not successively take the Square of the candidate, after initially extracting the exponent component? When the square results in a number greater than 2, divide by two, and set the bit in the mantissa of your result that corresponds to the iteration. This is a much quicker and simpler way of determining log base 2, which can then in a single multiplication, be transformed to the e or 10 base.
Some predefined functions don't work at singularity points. One simple-minded solution is to compute the series by adding terms from a peculiar sequence.
For your example, the sequence would be :
sum = 0
for i in range(n):
sum+= x^k/k
sum = -sum
for log(1-x)
Then you keep adding a lot of terms or until the last term is under a small threshold.
I need to estimate the size of a population, by finding the value of n which maximises scipy.misc.comb(n, a)/n**b where a and b are constants. n, a and b are all integers.
Obviously, I could just have a loop in range(SOME_HUGE_NUMBER), calculate the value for each n and break out of the loop once I reach an inflexion in the curve. But I wondered if there was an elegant way of doing this with (say) numpy/scipy, or is there some other elegant way of doing this just in pure Python (e.g. like an integer equivalent of Newton's method?)
As long as your number n is reasonably small (smaller than approx. 1500), my guess for the fastest way to do this is to actually try all possible values. You can do this quickly by using numpy:
import numpy as np
import scipy.misc as misc
nMax = 1000
a = 77
b = 100
n = np.arange(1, nMax+1, dtype=np.float64)
val = misc.comb(n, a)/n**b
print("Maximized for n={:d}".format(int(n[val.argmax()]+0.5)))
# Maximized for n=181
This is not especially elegant but rather fast for that range of n. Problem is that for n>1484 the numerator can already get too large to be stored in a float. This method will then fail, as you will run into overflows. But this is not only a problem of numpy.ndarray not working with python integers. Even with them, you would not be able to compute:
misc.comb(10000, 1000, exact=True)/10000**1001
as you want to have a float result in your division of two numbers larger than the maximum a float in python can hold (max_10_exp = 1024 on my system. See sys.float_info().). You couldn't use your range in that case, as well. If you really want to do something like that, you will have to take more care numerically.
You essentially have a nicely smooth function of n that you want to maximise. n is required to be integral but we can consider the function instead to be a function of the reals. In this case, the maximising integral value of n must be close to (next to) the maximising real value.
We could convert comb to a real function by using the gamma function and use numerical optimisation techniques to find the maximum. Another approach is to replace the factorials with Stirling's approximation. This gives a moderately complicated but tractable algebraic expression. This expression is not hard to differentiate and set to zero to find the extrema.
I did this and obtained
n * (b + (n-a) * log((n-a)/n) ) = a * b - a/2
This is not straightforward to solve algebraically but easy enough numerically (e.g. using Newton's method, as you suggest).
I may have made a mistake in the algebra, but I typed the a = 77, b = 100 example into Wolfram Alpha and got 180.58 so the approach seems to work.
I need to compute the integral of the following function within ranges that start as low as -150:
import numpy as np
from scipy.special import ndtr
def my_func(x):
return np.exp(x ** 2) * 2 * ndtr(x * np.sqrt(2))
The problem is that this part of the function
np.exp(x ** 2)
tends toward infinity -- I get inf for values of x less than approximately -26.
And this part of the function
2 * ndtr(x * np.sqrt(2))
which is equivalent to
from scipy.special import erf
1 + erf(x)
tends toward 0.
So, a very, very large number times a very, very small number should give me a reasonably sized number -- but, instead of that, python is giving me nan.
What can I do to circumvent this problem?
I think a combination of #askewchan's solution and scipy.special.log_ndtr will do the trick:
from scipy.special import log_ndtr
_log2 = np.log(2)
_sqrt2 = np.sqrt(2)
def my_func(x):
return np.exp(x ** 2) * 2 * ndtr(x * np.sqrt(2))
def my_func2(x):
return np.exp(x * x + _log2 + log_ndtr(x * _sqrt2))
print(my_func(-150))
# nan
print(my_func2(-150)
# 0.0037611803122451198
For x <= -20, log_ndtr(x) uses a Taylor series expansion of the error function to iteratively compute the log CDF directly, which is much more numerically stable than simply taking log(ndtr(x)).
Update
As you mentioned in the comments, the exp can also overflow if x is sufficiently large. Whilst you could work around this using mpmath.exp, a simpler and faster method is to cast up to a np.longdouble which, on my machine, can represent values up to 1.189731495357231765e+4932:
import mpmath
def my_func3(x):
return mpmath.exp(x * x + _log2 + log_ndtr(x * _sqrt2))
def my_func4(x):
return np.exp(np.float128(x * x + _log2 + log_ndtr(x * _sqrt2)))
print(my_func2(50))
# inf
print(my_func3(50))
# mpf('1.0895188633566085e+1086')
print(my_func4(50))
# 1.0895188633566084842e+1086
%timeit my_func3(50)
# The slowest run took 8.01 times longer than the fastest. This could mean that
# an intermediate result is being cached 100000 loops, best of 3: 15.5 µs per
# loop
%timeit my_func4(50)
# The slowest run took 11.11 times longer than the fastest. This could mean
# that an intermediate result is being cached 100000 loops, best of 3: 2.9 µs
# per loop
There already is such a function: erfcx. I think erfcx(-x) should give you the integrand you want (note that 1+erf(x)=erfc(-x)).
Not sure how helpful will this be, but here are a couple of thoughts that are too long for a comment.
You need to calculate the integral of , which you correctly identified would be . Opening the brackets you can integrate both parts of the summation.
Scipy has this imaginary error function implemented
The second part is harder:
This is a generalized hypergeometric function. Sadly it looks like scipy does not have an implementation of it, but this package claims it does.
Here I used indefinite integrals without constants, knowing the from to values it is clear how to use definite ones.
When you use the POISSON function in Excel (or in OpenOffice Calc), it takes two arguments:
an integer
an 'average' number
and returns a float.
In Python (I tried RandomArray and NumPy) it returns an array of random poisson numbers.
What I really want is the percentage that this event will occur (it is a constant number and the array has every time different numbers - so is it an average?).
for example:
print poisson(2.6,6)
returns [1 3 3 0 1 3] (and every time I run it, it's different).
The number I get from calc/excel is 3.19 (POISSON(6,2.16,0)*100).
Am I using the python's poisson wrong (no pun!) or am I missing something?
scipy has what you want
>>> scipy.stats.distributions
<module 'scipy.stats.distributions' from '/home/coventry/lib/python2.5/site-packages/scipy/stats/distributions.pyc'>
>>> scipy.stats.distributions.poisson.pmf(6, 2.6)
array(0.031867055625524499)
It's worth noting that it's pretty easy to calculate by hand, too.
It is easy to do by hand, but you can overflow doing it that way. You can do the exponent and factorial in a loop to avoid the overflow:
def poisson_probability(actual, mean):
# naive: math.exp(-mean) * mean**actual / factorial(actual)
# iterative, to keep the components from getting too large or small:
p = math.exp(-mean)
for i in xrange(actual):
p *= mean
p /= i+1
return p
This page explains why you get an array, and the meaning of the numbers in it, at least.