Accuracy of deriving the CDF using integration - python

I have two ways of deriving the probability of a normally (say) distributed random variable to be within an interval. The first and most straight-forward is the following:
import scipy.stats
print scipy.stats.norm.cdf(6) - scipy.stats.norm.cdf(5)
# 2.85664984223e-07
And the second is by integrating the pdf:
import scipy.integrate
print scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
# 2.85664984234e-07
The difference in this case is really tiny, but it doesn't mean it can't grow larger for other distributions or integration limits. Can you tell which is more accurate and why?
By the way, the first alternative seems to be at least 10 times faster, so if it is also more accurate (which would be my guess, since it is somewhat specialized), then it is perfect.

In this particular case, given those particular numbers, the quad approach will actually be more accurate. The CDF itself can be computed quickly and accurately, of course, but look at the actual numbers:
>>> scipy.stats.norm.cdf(6), scipy.stats.norm.cdf(5)
(0.9999999990134123, 0.99999971334842808)
When you're differencing two very similar quantities, you lose accuracy. Similar problems can be mitigated somewhat during integration if the coders are careful with their summations.
Anyway, we can check this against a high-resolution calculation using mpmath:
>>> via_cdf = scipy.stats.norm.cdf(6)-scipy.stats.norm.cdf(5)
>>> via_quad = scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
>>> import mpmath
>>> mpmath.mp.dps = 100
>>> def cdf(x): return 0.5 * (1 + mpmath.erf(x/mpmath.sqrt(2)))
>>> highres = cdf(6)-cdf(5)
>>> highres
mpf('0.0000002856649842341562135330514687422473118357532223619105443630157837185833042478210791954518847897468442097')
>>> float((highres - via_quad)/highres)
-2.3824773334590333e-16
>>> float((highres - via_cdf)/highres)
3.86659439572868e-11

The first calls an implementation of the cdf included in scipy.special. The latter actually does the integration. The former is probably more accurate (as it is limited only by the computer's ability do evaluate the CDF and not by any errors introduced by numerical integration). In practice, unless you need results that are good to better than 6 decimal places, you're probably fine.

Related

Word vector similarity precision

I am trying to implement Gensim's most_similar function by hand but calculate the similarity between the query word and just one other word (avoiding the time to calculate it for the query word with all other words). So far I use
cossim = (np.dot(a, b)
/ np.linalg.norm(a)
/ np.linalg.norm(b))
and this is the same as the similarity result between a and b. I find this works almost exactly but that some precision is lost, for example
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
model_gigaword = api.load("glove-wiki-gigaword-300")
a = 'france'
b = 'chirac'
cossim1 = model_gigaword.most_similar(a)
import numpy as np
cossim2 = (np.dot(model_gigaword[a], model_gigaword[b])
/ np.linalg.norm(model_gigaword[a])
/ np.linalg.norm(model_gigaword[b]))
print(cossim1)
print(cossim2)
Output:
[('french', 0.7344760894775391), ('paris', 0.6580672264099121), ('belgium', 0.620672345161438), ('spain', 0.573593258857727), ('italy', 0.5643460154533386), ('germany', 0.5567398071289062), ('prohertrib', 0.5564222931861877), ('britain', 0.5553334355354309), ('chirac', 0.5362644195556641), ('switzerland', 0.5320892333984375)]
0.53626436
So the most_similar function gives 0.53626441955... (rounds to 0.53626442) and the calculation with numpy gives 0.53626436. Similarly, you can see differences between the values for 'paris' and 'italy' (in similarity compared to 'france'). These differences suggest that the calculation is not being done to full precision (but it is in Gensim). How can I fix it and get the output for a single similarity to higher precision, exactly as it comes from most_similar?
TL/DR - I want to use function('france', 'chirac') and get 0.5362644195556641, not 0.53626436.
Any idea what's going on?
UPDATE: I should clarify, I want to know and replicate how most_similar does the computation, but for only one (a,b) pair. That's my priority, rather than finding out how to improve the precision of my cossim calculation above. I just assumed the two were equivalent.
To increase accuracy you can try the following:
a = np.array(model_gigaword[a]).astype('float128')
b = np.array(model_gigaword[b]).astype('float128')
cossim = (np.dot(a, b)
/ np.linalg.norm(a)
/ np.linalg.norm(b))
The vectors are likely to use lower-precision floats and hence there is loss precision in calculations.
However, the results I got are somewhat different to what model_gigaword.most_similar offers for you:
model_gigaword.similarity: 0.5362644
float64: 0.5362644263010196
float128: 0.53626442630101950744
You may want to check what you get on your machine and with your version of Python and gensim.
Because floating-point numbers (like the np.float32-typed values in these vector models) are represented using an imprecise binary approximation, none of the numbers you're working with, or displaying, are the exact decimal numbers you think they are.
The number you're seeing as 0.53626436 isn't exactly that - but some binary floating-point number very close to that number. Similarly, the number you're seeing as 0.5362644195556641 isn't exactly that – but some other binary floating-point number, ver close to that.
Further, these tiny imprecisions can mean that mathematical expressions that should under ideal circumstances give identical results to each other, no matter the order-of-evaluation, instead give slightly different results for different orders-of-evaluation. For example, we know that mathematically, a * (b + c) is always equal to ab + ac. However, if a, b, & c are floating-point numbers with limited precision, the results of doing the addition then multiplication, versus doing two multiplications then one addition, might vary - because the interim values would have been approximated slightly differently.
But: for nearly all domains in which these numbers are used, this tiny amount of noise shouldn't make any difference. The right policy is to ignore it, and write code that's robust to this small 'jitter' in extremely-low-significance digits - especially when printing or comparing results.
So really you should only be printing/comparing these numbers to a level of significance where they reliably agree, say, 4 digits after the decimal:
0.53626436
0.5362644195556641
(In fact, your output already makes it look like you may have changed the default level of display-precision in numpy or python, because it wouldn't be typical for the results of most_simlar() to display with those 16 digits after the decimal.)
If you really, really wanted, as an exploration, to match the most_similar() results exactly, you could look at its source code. Then, perform the exact same steps, in the exact same order, using the exact same library routines, on your inputs.
(Here's the source for most_similar() in the current gensim-4.0.0beta prerelease: https://github.com/RaRe-Technologies/gensim/blob/4.0.0beta/gensim/models/keyedvectors.py#L690)
But: insisting on such exact correspondence is usually unwise, & creates more-fragile code, given the inherent imprecision in floating-point math.
See also: another answer covering some similar issues, which also points out a way to change the default displayed precision.

Integer optimization/maximization in numpy

I need to estimate the size of a population, by finding the value of n which maximises scipy.misc.comb(n, a)/n**b where a and b are constants. n, a and b are all integers.
Obviously, I could just have a loop in range(SOME_HUGE_NUMBER), calculate the value for each n and break out of the loop once I reach an inflexion in the curve. But I wondered if there was an elegant way of doing this with (say) numpy/scipy, or is there some other elegant way of doing this just in pure Python (e.g. like an integer equivalent of Newton's method?)
As long as your number n is reasonably small (smaller than approx. 1500), my guess for the fastest way to do this is to actually try all possible values. You can do this quickly by using numpy:
import numpy as np
import scipy.misc as misc
nMax = 1000
a = 77
b = 100
n = np.arange(1, nMax+1, dtype=np.float64)
val = misc.comb(n, a)/n**b
print("Maximized for n={:d}".format(int(n[val.argmax()]+0.5)))
# Maximized for n=181
This is not especially elegant but rather fast for that range of n. Problem is that for n>1484 the numerator can already get too large to be stored in a float. This method will then fail, as you will run into overflows. But this is not only a problem of numpy.ndarray not working with python integers. Even with them, you would not be able to compute:
misc.comb(10000, 1000, exact=True)/10000**1001
as you want to have a float result in your division of two numbers larger than the maximum a float in python can hold (max_10_exp = 1024 on my system. See sys.float_info().). You couldn't use your range in that case, as well. If you really want to do something like that, you will have to take more care numerically.
You essentially have a nicely smooth function of n that you want to maximise. n is required to be integral but we can consider the function instead to be a function of the reals. In this case, the maximising integral value of n must be close to (next to) the maximising real value.
We could convert comb to a real function by using the gamma function and use numerical optimisation techniques to find the maximum. Another approach is to replace the factorials with Stirling's approximation. This gives a moderately complicated but tractable algebraic expression. This expression is not hard to differentiate and set to zero to find the extrema.
I did this and obtained
n * (b + (n-a) * log((n-a)/n) ) = a * b - a/2
This is not straightforward to solve algebraically but easy enough numerically (e.g. using Newton's method, as you suggest).
I may have made a mistake in the algebra, but I typed the a = 77, b = 100 example into Wolfram Alpha and got 180.58 so the approach seems to work.

Sum with Pythons uncertainties giving a different result than expected

A friend of mine is evaluating data with Pythons package uncertainties. I am her statistics consulter, and I have come up with a weird result in her code.
sum(array) and sqrt(sum(unumpy.std_devs(array)**2)) yield different results, with the second one being the variance method as usually used in engineering.
Now, I know that the variance approach is only suited for when the error is small compared to the partial derivate (because of the Taylor series) which isn't given in this case, but how does uncertainties handle this? And how can I reproduce in any way what uncertainties does!?
You forgot to square the standard error to make it the variance. This should work and be equal to the error of sum(array):
sqrt(sum(unumpy.std_devs(array)**2))
Then
from uncertainties import unumpy
import random
import math
a = [uc.ufloat(random.random(), random.random()) for _ in range(100)]
sa = unumpy.std_devs(sum(a))
sb = math.sqrt(sum(unumpy.std_devs(a)**2))
print(sa)
print(sb)
print(sa == sb)
Will result with something like
5.793714811166615
5.793714811166615
True
This results due to my array being an AffineScalarFunc (as opposed to a Variable), and thus they not only store the value but also all the variables that the value depends on [1].
Now, my values are not fully independent (which wasn't clear at all at first sight*), and thus sum(array) also considers the off-diagonal elements of my covariance matrix in accordance to this formula (sorry that the article is in German, but English Wikipedias formula isn't as intuitive), whereas sqrt(sum(unumpy.std_devs(array)**2)) obviously doesn't and just adds up the diagonal elements.
A way to reproduce what uncertainties does is:
from uncertainties import covariance_matrix
sum=0
for i in range(0,len(array)):
for j in range(0,len(array)):
sum+=covariancematrix(array)[i][j]
print(sqrt(sum))
And then unumpy.std_devs(sum(array))==sqrt(sum) is True.
*Correlation due to the use of data taken from the same interpolation (of measurements) and because the length of a measurement was calculated as the difference of two times (and meassurement were consecutive, so the times are now correlated!)

Python using scipy.optimise to find the solution to an equation

I want to solve an equation using scipy.optimise
I want to find the solution, n, for the equation
a**n + b**n = c**n
where
a=2.3
b=2.4
c=2.94
I have a list of triplets (a,b,c) I want to experiment with and I know the range of the exponent n will always be 2.0 < n < 4.0. Could I use this fact to speed up the convergence of the solution.
If your function is scalar, and accepts a scalar (your case), and if you know that:
your solution is in a given interval, and the function is continuous in the same interval (your case)
you are interested in one solution, not necessarily in all (if more than 1) solutions in that interval
You can speed up the solution using the bisection algorithm, implemented here in scipy, which requires the conditions above to guarantee convergence.
The idea behind the algorithm is quite simple, with log convergence.
See this fundamental calculus theorem on which the algorithm is based.
EDIT: I couldn't resist, here you have a MWE
import scipy.optimize as opt
def sol(a,b,c):
f = lambda n : a**n + b**n - c**n
return opt.bisect(f,2,4)
print(sol(2.3,2.4,2.94)
>3.1010655957
As requested in the comments, here's how to do it using mpmath.
We supply the a, b, c parameters as strings rather than as Python floats for maximum accuracy. Converting strings to mpf (mp floats) will be as accurate as the current precision allows. If instead we convert from Python floats then we'd be using numbers that suffer from the imprecision inherent in Python floats.
mp.dps allows us to set the precision in the form of the number of decimal digits.
The mpmath findroot function accepts an initial approximation argument. This can be a single value, or it may be an interval, given as a list or a tuple. It's ok to use Python floats in that interval.
from mpmath import mp
mp.dps = 30
a, b, c = [mp.mpf(u) for u in ('2.3', '2.4', '2.94')]
def f(x):
return a**x + b**x - c**x
x = mp.findroot(f, [2, 4])
print(x, f(x))
output
3.10106559575904097402104750305 -3.15544362088404722164691426113e-30
By default, findroot uses a simple secant solver. The docs recommend using the 'anderson' or 'ridder' solvers when supplying an interval, but for this equation all 3 solvers give identical results.

python scipy.sparse.linalg.eigs giving different results for consecutive calls

I am trying to compute the spectral radius of a sparse matrix in python. This is what I have:
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
spec_radius = max(abs(eigs(w)[0]))
where the values of w get scaled to be in the range of [-1,1]. However, running that command gives a different result every time:
>>> print max(abs(eigs(w)[0]))
4.51859016293e-05
>>> print max(abs(eigs(w)[0]))
4.02309443625e-06
>>> print max(abs(eigs(w)[0]))
3.7611221426e-05
What gives? I would have thought it would be the same every time. Am I misunderstanding how these commands work?
Sorry for responding to an old question here, but the other answer is not quite satisfactory.
The randomness is not part of the algorithm bundled with ARPACK, but rather the initialization of the algorithm. From the scipy documentation the initialization, v0, is random, unless specified by the user. Sure enough, we see this (note the setup is slightly different--entries of w are scaled to be in [0,1]):
import numpy
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
w = w/w.max()
if we do not specify v0, we get some (slight) randomness:
>>> print max(abs(eigs(w)[0]))
0.00024188777676476916
>>> print max(abs(eigs(w)[0]))
0.00028073646868200566
>>> print max(abs(eigs(w)[0]))
0.00025250058038424729
>>> print max(abs(eigs(w)[0]))
0.00018183677959035711
but, if we specify, the initialization, we always get the same answer:
>>> print numpy.all([max(abs(eigs(w, v0 = numpy.ones(10))[0])) == 0.00026363015600771211 for k in range(1000)])
True
You apply the methods correctly and they will give you the same results if the absolute value of the largest eigenvalue is significantly larger than 0. The reason for the outcome you observe is based on the iterative nature of the algorithm that is used to determine the eigenvalues. From the documentation:
"This function is a wrapper to the ARPACK [R209] SNEUPD, DNEUPD, CNEUPD, ZNEUPD, functions which use the Implicitly Restarted Arnoldi Method to find the eigenvalues and eigenvectors [R210]." In case you are interested in details of this algorithm, you can find an explanation e.g. here.
As in all numerical methods you can determine the desired value only with a certain precision. For eigenvalues that are significantly unequal to 0, you will always obtain the same output; for values that are close to 0 you might obtain different values as you observed in the above example.
You can try to vary the parameters "maxiter" and "tol" (check the above cited documentation for details) which you can pass to "eigs". Maxiter is the maximum number of Arnoldi update iterations allowed - by increasing the number you should get more accurate results. "Tol" is the relative accuracy for eigenvalues and stopping criterion of the algorithm.

Categories