Integer optimization/maximization in numpy - python

I need to estimate the size of a population, by finding the value of n which maximises scipy.misc.comb(n, a)/n**b where a and b are constants. n, a and b are all integers.
Obviously, I could just have a loop in range(SOME_HUGE_NUMBER), calculate the value for each n and break out of the loop once I reach an inflexion in the curve. But I wondered if there was an elegant way of doing this with (say) numpy/scipy, or is there some other elegant way of doing this just in pure Python (e.g. like an integer equivalent of Newton's method?)

As long as your number n is reasonably small (smaller than approx. 1500), my guess for the fastest way to do this is to actually try all possible values. You can do this quickly by using numpy:
import numpy as np
import scipy.misc as misc
nMax = 1000
a = 77
b = 100
n = np.arange(1, nMax+1, dtype=np.float64)
val = misc.comb(n, a)/n**b
print("Maximized for n={:d}".format(int(n[val.argmax()]+0.5)))
# Maximized for n=181
This is not especially elegant but rather fast for that range of n. Problem is that for n>1484 the numerator can already get too large to be stored in a float. This method will then fail, as you will run into overflows. But this is not only a problem of numpy.ndarray not working with python integers. Even with them, you would not be able to compute:
misc.comb(10000, 1000, exact=True)/10000**1001
as you want to have a float result in your division of two numbers larger than the maximum a float in python can hold (max_10_exp = 1024 on my system. See sys.float_info().). You couldn't use your range in that case, as well. If you really want to do something like that, you will have to take more care numerically.

You essentially have a nicely smooth function of n that you want to maximise. n is required to be integral but we can consider the function instead to be a function of the reals. In this case, the maximising integral value of n must be close to (next to) the maximising real value.
We could convert comb to a real function by using the gamma function and use numerical optimisation techniques to find the maximum. Another approach is to replace the factorials with Stirling's approximation. This gives a moderately complicated but tractable algebraic expression. This expression is not hard to differentiate and set to zero to find the extrema.
I did this and obtained
n * (b + (n-a) * log((n-a)/n) ) = a * b - a/2
This is not straightforward to solve algebraically but easy enough numerically (e.g. using Newton's method, as you suggest).
I may have made a mistake in the algebra, but I typed the a = 77, b = 100 example into Wolfram Alpha and got 180.58 so the approach seems to work.

Related

Logarithm over x

Since the following expansion for the logarithm holds:
log(1-x)=-x-x^2/2-x^3/3-...
one can calculate the following functions which have removable singularities at x:
log(1-x)/x=-1-x/2-...
(log(1-x)/x+1)/x=-1/2-x/3-...
((log(1-x)/x+1)/x+1/2)/x=-1/3-x/4-...
I am trying to use NumPy for these calculations, and specifically the log1p function, which is accurate near x=0. However, convergence for the aforementioned functions is still problematic.
Do you have any ideas for any existing functions implementing these formulas or should I write one myself using the previous expansions, which will not be as efficient, however?
The simplest thing to do is something like
In [17]: def logf(x, eps=1e-6):
...: if abs(x) < eps:
...: return -0.5 - x/3.
...: else:
...: return (1. + log1p(-x)/x)/x
and play a bit with the threshold eps.
If you want a numpy-like, vectorized solution, replace an if with a np.where
>>> np.where(x > eps, 1. + log1p(-x)/x) / x, -0.5 - x/3.)
Why not successively take the Square of the candidate, after initially extracting the exponent component? When the square results in a number greater than 2, divide by two, and set the bit in the mantissa of your result that corresponds to the iteration. This is a much quicker and simpler way of determining log base 2, which can then in a single multiplication, be transformed to the e or 10 base.
Some predefined functions don't work at singularity points. One simple-minded solution is to compute the series by adding terms from a peculiar sequence.
For your example, the sequence would be :
sum = 0
for i in range(n):
sum+= x^k/k
sum = -sum
for log(1-x)
Then you keep adding a lot of terms or until the last term is under a small threshold.

Python using scipy.optimise to find the solution to an equation

I want to solve an equation using scipy.optimise
I want to find the solution, n, for the equation
a**n + b**n = c**n
where
a=2.3
b=2.4
c=2.94
I have a list of triplets (a,b,c) I want to experiment with and I know the range of the exponent n will always be 2.0 < n < 4.0. Could I use this fact to speed up the convergence of the solution.
If your function is scalar, and accepts a scalar (your case), and if you know that:
your solution is in a given interval, and the function is continuous in the same interval (your case)
you are interested in one solution, not necessarily in all (if more than 1) solutions in that interval
You can speed up the solution using the bisection algorithm, implemented here in scipy, which requires the conditions above to guarantee convergence.
The idea behind the algorithm is quite simple, with log convergence.
See this fundamental calculus theorem on which the algorithm is based.
EDIT: I couldn't resist, here you have a MWE
import scipy.optimize as opt
def sol(a,b,c):
f = lambda n : a**n + b**n - c**n
return opt.bisect(f,2,4)
print(sol(2.3,2.4,2.94)
>3.1010655957
As requested in the comments, here's how to do it using mpmath.
We supply the a, b, c parameters as strings rather than as Python floats for maximum accuracy. Converting strings to mpf (mp floats) will be as accurate as the current precision allows. If instead we convert from Python floats then we'd be using numbers that suffer from the imprecision inherent in Python floats.
mp.dps allows us to set the precision in the form of the number of decimal digits.
The mpmath findroot function accepts an initial approximation argument. This can be a single value, or it may be an interval, given as a list or a tuple. It's ok to use Python floats in that interval.
from mpmath import mp
mp.dps = 30
a, b, c = [mp.mpf(u) for u in ('2.3', '2.4', '2.94')]
def f(x):
return a**x + b**x - c**x
x = mp.findroot(f, [2, 4])
print(x, f(x))
output
3.10106559575904097402104750305 -3.15544362088404722164691426113e-30
By default, findroot uses a simple secant solver. The docs recommend using the 'anderson' or 'ridder' solvers when supplying an interval, but for this equation all 3 solvers give identical results.

Log-computations in Python

I'm looking to compute something like:
Where f(i) is a function that returns a real number in [-1,1] for any i in {1,2,...,5000}.
Obviously, the result of the sum is somewhere in [-1,1], but when I can't seem to be able to compute it in Python using straight forward coding, as 0.55000 becomes 0 and comb(5000,2000) becomes inf, which result in the computed sum turning into NaN.
The required solution is to use log on both sides.
That is using the identity a × b = 2log(a) + log(b), if I could compute log(a) and log(b) I could compute the sum, even if a is big and b is almost 0.
So I guess what I'm asking is if there's an easy way of computing
log2(scipy.misc.comb(5000,2000))
So I could compute my sum simply by
sum([2**(log2comb(5000,i)-5000) * f(i) for i in range(1,5000) ])
#abarnert's solution, while working for the 5000 figure addresses the problem by increasing the precision in which the comb is computed. This works for this example, but doesn't scale, as the memory required would significantly increase if instead of 5000 we had 1e7 for example.
Currently, I'm using a workaround which is ugly, but keeps memory consumption low:
log2(comb(5000,2000)) = sum([log2 (x) for x in 1:5000])-sum([log2 (x) for x in 1:2000])-sum([log2 (x) for x in 1:3000])
Is there a way of doing so in a readable expression?
The sum
is the expectation of f with respect to a binomial distribution with n = 5000 and p = 0.5.
You can compute this with scipy.stats.binom.expect:
import scipy.stats as stats
def f(i):
return i
n, p = 5000, 0.5
print(stats.binom.expect(f, (n, p), lb=0, ub=n))
# 2499.99999997
Also note that as n goes to infinity, with p fixed, the binomial distribution approaches the normal distribution with mean np and variance np*(1-p). Therefore, for large n you can instead compute:
import math
print(stats.norm.expect(f, loc=n*p, scale=math.sqrt((n*p*(1-p))), lb=0, ub=n))
# 2500.0
EDIT: #unutbu has answered the real question, but I'll leave this here in case log2comb(n, k) is useful to anyone.
comb(n, k) is n! / ((n-k)! k!), and n! can be computed using the Gamma function gamma(n+1). Scipy provides the function scipy.special.gamma. Scipy also provides gammaln, which is the log (natural log, that is) of the Gamma function.
So log(comb(n, k)) can be computed as gammaln(n+1) - gammaln(n-k+1) - gammaln(k+1)
For example, log(comb(100, 8)) (after executing from scipy.special import gammaln):
In [26]: log(comb(100, 8))
Out[26]: 25.949484949043022
In [27]: gammaln(101) - gammaln(93) - gammaln(9)
Out[27]: 25.949484949042962
and log(comb(5000, 2000)):
In [28]: log(comb(5000, 2000)) # Overflow!
Out[28]: inf
In [29]: gammaln(5001) - gammaln(3001) - gammaln(2001)
Out[29]: 3360.5943053174142
(Of course, to get the base-2 logarithm, just divide by log(2).)
For convenience, you can define:
from math import log
from scipy.special import gammaln
def log2comb(n, k):
return (gammaln(n+1) - gammaln(n-k+1) - gammaln(k+1)) / log(2)
By default, comb gives you a float64, which overflows and gives you inf.
But if you pass exact=True, it gives you a Python variable-sized int instead, which can't overflow (unless you get so ridiculously huge you run out of memory).
And, while you can't use np.log2 on an int, you can use Python's math.log2.
So:
math.log2(scipy.misc.comb(5000, 2000, exact=True))
As an alternative, you relative that n choose k is defined as n!k / k!, right? You can reduce that to ∏(i=1...k)((n+1-i)/i), which is simple to compute.
Or, if you want to avoid overflow, you can do it by alternating * (n-i) and / (k-i).
Which, of course, you can also reduce to adding and subtracting logs. I think looping in Python and computing 4000 logarithms is going to be slower than looping in C and computing 4000 multiplications, but we can always vectorize it, and then, it might be faster. Let's write it and test:
In [1327]: n, k = 5000, 2000
In [1328]: %timeit math.log2(scipy.misc.comb(5000, 2000, exact=True))
100 loops, best of 3: 1.6 ms per loop
In [1329]: %timeit np.log2(np.arange(n-k+1, n+1)).sum() - np.log2(np.arange(1, k+1)).sum()
10000 loops, best of 3: 91.1 µs per loop
Of course if you're more concerned with memory instead of time… well, this obviously makes it worse. We've got 2000 8-byte floats instead of one 608-byte integer at a time. And if you go up to 100000, 20000, you get 20000 8-byte floats instead of one 9K integer. And at 1000000, 200000, it's 200000 8-byte floats vs. one 720K integer.
I'm not sure why either way is a problem for you. Especially given that you're using a listcomp instead of a genexpr, and therefore creating an unnecessary list of 5000, 100000, or 1000000 Python floats—24MB is not a problem, but 720K is? But if it is, we can obviously just do the same thing iteratively, at the cost of some speed:
r = sum(math.log2(n-i) - math.log2(k-i) for i in range(n-k))
This isn't too much slower than the scipy solution, and it never uses more than a small constant number of bytes (a handful of Python floats). (Unless you're on Python 2, in which case… just use xrange instead of range and it's back to constant.)
As a side note, why are you using a list comprehension instead of an NumPy array with vectorized operations (for speed, and also a bit of compactness) or a generator expression instead of a list comprehension (for no memory usage at all, at no cost to speed)?

Algorithm to calculate point at which to round values in an array up or down in order to least affect the mean

Consider array random array of values between 0 and 1 such as:
[0.1,0.2,0.8,0.9]
is there a way to calculate the point at which the values should be rounded down or up to an integer in order to match the mean of the un-rounded array the closest? (in above case it would be at the mean but that is purely a coincidence)
or is it just trial and error?
im coding in python
thanks for any help
Add them up, then round the sum. That's how many 1s you want. Round so you get that many 1s.
def rounding_point(l):
# if the input is sorted, you don't need the following line
l = sorted(l)
ones_needed = int(round(sum(l)))
# this may require adjustment if there are duplicates in the input
return 1.0 if ones_needed == len(l) else l[-ones_needed]
If sorting the list turns out to be too expensive, you can use a selection algorithm like quickselect. Python doesn't come with a quickselect function built in, though, so don't bother unless your inputs are big enough that the asymptotic advantage of quickselect outweighs the constant factor advantage of the highly-optimized C sorting algorithm.

Calculate poisson probability percentage

When you use the POISSON function in Excel (or in OpenOffice Calc), it takes two arguments:
an integer
an 'average' number
and returns a float.
In Python (I tried RandomArray and NumPy) it returns an array of random poisson numbers.
What I really want is the percentage that this event will occur (it is a constant number and the array has every time different numbers - so is it an average?).
for example:
print poisson(2.6,6)
returns [1 3 3 0 1 3] (and every time I run it, it's different).
The number I get from calc/excel is 3.19 (POISSON(6,2.16,0)*100).
Am I using the python's poisson wrong (no pun!) or am I missing something?
scipy has what you want
>>> scipy.stats.distributions
<module 'scipy.stats.distributions' from '/home/coventry/lib/python2.5/site-packages/scipy/stats/distributions.pyc'>
>>> scipy.stats.distributions.poisson.pmf(6, 2.6)
array(0.031867055625524499)
It's worth noting that it's pretty easy to calculate by hand, too.
It is easy to do by hand, but you can overflow doing it that way. You can do the exponent and factorial in a loop to avoid the overflow:
def poisson_probability(actual, mean):
# naive: math.exp(-mean) * mean**actual / factorial(actual)
# iterative, to keep the components from getting too large or small:
p = math.exp(-mean)
for i in xrange(actual):
p *= mean
p /= i+1
return p
This page explains why you get an array, and the meaning of the numbers in it, at least.

Categories