Vectorized computation of log(n!) - python

I have an (arbitrarily shaped) array X of integers, and I would like to compute the logarithm of the factorial of each entry (Precisely, not through the Gamma function).
The numbers are big enough that
np.log(scipy.special.factorial(X))
is unfeasible. So I want to do something like np.sum(np.log(np.arange(2,X+1)), axis=-1)
But the arange() function gives a different size to each entry, so this doesn't work. I though about padding with ones, but I'm not sure how to do this.
Can this be done in a vectorized way?

I don't see what problem you have with the gamma function. The gamma function isn't an approximation, and while approximations may be involved in the computation of scipy.special.gammaln, there's no reason to expect those approximations to be worse than the error involved in computing the result manually. scipy.special.gammaln seems like the perfect tool for the job:
X_log_factorials = scipy.special.gammaln(X+1)
If you want to do this manually anyway, you could take the logarithms of all positive integers up to the maximum of your array, compute a cumulative sum, and then select the log-factorials you're interested in:
logarithms = numpy.log(numpy.arange(1, X.max()+1))
log_factorials = numpy.cumsum(logarithms)
X_log_factorials = log_factorials[X-1]
(If you want to handle 0!, you will need to make a minor adjustment, such as by setting X_log_factorials[X==0] = 0.)

Related

Rounding errors: deal with operation on vectors with very small components

Imagine to have some vectors (could be a torch tensor or a numpy array) with a huge number of components, each one very small (~ 1e-10).
Let's say that we want to calculate the norm of one of these vectors (or the dot product between two of them). Also using a float64 data type the precision on each component will be ~1e-10, while the product of 2 component (during the norm/dot product computation) can easily reach ~1e-20 causing a lot of rounding errors that, summed up together return a wrong result.
Is there a way to deal with this situation? (For example is there a way to define arbitrary precision array for these operations, or some built in operator that take care of that automatically?)
You are dealing with two different issues here:
Underflow / Overflow
Calculating the norm of very small values may underflow to zero when you calculate the square. Large values may overflow to infinity. This can be solved by using a stable norm algorithm.
A simple way to deal with this is to scale the values temporarily. See for example this:
a = np.array((1e-30, 2e-30), dtype='f4')
np.linalg.norm(a) # result is 0 due to underflow in single precision
scale = 1. / np.max(np.abs(a))
np.linalg.norm(a * scale) / scale # result is 2.236e-30
This is now a two-pass algorithm because you have to iterate over all your data before determining a scaling value. If this is not to your liking, there are single-pass algorithms, though you probably don't want to implement them in Python. The classic would be Blue's algorithm:
http://degiorgi.math.hr/~singer/aaa_sem/Float_Norm/p15-blue.pdf
A simpler but much less efficient way is to simply chain calls to hypot (which uses a stable algorithm). You should never do this, but just for completion:
norm = 0.
for value in a:
norm = math.hypot(norm, value)
Or even a hierarchical version like this to reduce the number of numpy calls:
norm = a
while len(norm) > 1:
hlen = len(norm) >> 1
front, back = norm[:hlen], norm[hlen: 2 * hlen]
tail = norm[2 * hlen:] # only present with length is not even
norm = np.append(np.hypot(front, back), tail)
norm = norm[0]
You are free to combine these strategies. For example if you don't have your data available all at once but blockwise (e.g. because the data set is too large and you read it from disk), you can pick a scaling value per block, then chain the blocks together with a few calls to hypot.
Rounding errors
You accumulate rounding errors, especially when accumulating values of different magnitude. If you accumulate values of different signs, you may also experience catastrophic cancelation. To avoid these issues, you need to use a compensated summation scheme. Python provides a very good one with math.fsum.
So if you absolutely need highest accuracy, go with something like this:
math.sqrt(math.fsum(np.square(a * scale))) / scale
Note that this is overkill for a simple norm since there are no sign changes in the accumulation (so no cancelation) and the squaring increases all differences in magnitude so that the result will always be dominated by its largest components, unless you are dealing with a truly horrifying dataset. That numpy does not provide built-in solutions for these issues tells you that the naive algorithm is actually good enough for most real-world applications. No reason to go overboard with the implementation before you actually run into trouble.
Application to dot products
I've focused on the l2 norm because that is the case that is more generally understood to be hazardous. Of course you can apply similar strategies to a dot product.
np.dot(a, b)
ascale = 1. / np.max(np.abs(a))
bscale = 1. / np.max(np.abs(b))
np.dot(a * ascale, b * bscale) / (ascale * bscale)
This is particularly useful if you use mixed precision. For example the dot product could be calculated in single precision but the x / (ascale * bscale) could take place in double or even extended precision.
And of course math.fsum is still available: dot = math.fsum(a * b)
Bonus thoughts
The whole scaling itself introduces some rounding errors because no one guarantees you that a/b is exactly representable in floating point. However, you can avoid this by picking a scaling factor that is an exact power of 2. Multiplying with a power of 2 is always exact in FP (assuming you stay in the representable range). You can get the exponent with math.frexp

scipy integrate over array with variable bounds

I am trying to integrate a function over a list of point and pass the whole array to an integration function in order ot vectorize the thing. For starters, calling scipy.integrate.quad is way too slow since I have something like 10 000 000 points to integrate. Using scipy.integrate.romberg does the trick much faster, almost instantaneous while quad is slow since you must loop over it or vectorize it.
My function is quite complicated, but for demonstation purpose, let's say I want to integrate x^2 from a to b, but x is an array of scalar to evaluate x. For example
import numpy as np
from scipy.integrate import quad, romberg
def integrand(x, y):
return x**2 + y**2
quad(integrand, 0, 10, args=(10) # this fails since y is not a scalar
romberg(integrand, 0, 10) # y works here, giving the integral over
# the entire range
But this only work for fixed bounds. Is there a way to do something like
z = np.arange(20,30)
romberg(integrand, 0, z) # Fails since the function doesn't seem to
# support variable bounds
Only way I see it is to re-implement the algorithm itself in numpy and use that instead so I can have variable bounds. Any function that supports something like this? There is also romb, where you must supply the values of integrand directly and a dx interval, but that will be too imprecise for my complicated function (the marcum Q function, couldn't find any implementation, that could be another way to dot it).
The best approach when trying to evaluate a special function is to write a function that uses the properties of the function to quickly and accurately evaluate it in all parameter regimes. It is quite unlikely that a single approach will give accurate (or even stable) results for all ranges of parameters. Direct evaluation of an integral, as in this case, will almost certainly break down in many cases.
That being said, the general problem of evaluating an integral over many ranges can be solved by turning the integral into a differential equation and solving that. Roughly, the steps would be
Given an integral I(t) which I will assume is an integral of a function f(x) from 0 to t [this can be generalized to an arbitrary lower limit], write it as the differential equation dI/dt = f(x).
Solve this differential equation using scipy.integrate.odeint() for some initial conditions (here I(0)) over some range of times from 0 to t. This range should contain all limits of interest. How finely this is sampled depends on the function and how accurately it needs to be evaluated.
The result will be the value of the integral from 0 to t for the set of t we input. We can turn this into a "continuous" function using interpolation. For example, using a spline we can define i = scipy.interpolate.InterpolatedUnivariateSpline(t,I).
Given a set of upper and lower limits in arrays b and a, respectively, then we can evaluate them all at once as res=i(b)-i(a).
Whether this approach will work in your case will require you to carefully study it over your range of parameters. Also note that the Marcum Q function involves a semi-infinite integral. In principle this is not a problem, just transform the integral to one over a finite range. For example, consider the transformation x->1/x. There is no guarantee this approach will be numerically stable for your problem.

Derivative of an array in python?

Currently I have two numpy arrays: x and y of the same size.
I would like to write a function (possibly calling numpy/scipy... functions if they exist):
def derivative(x, y, n = 1):
# something
return result
where result is a numpy array of the same size of x and containing the value of the n-th derivative of y regarding to x (I would like the derivative to be evaluated using several values of y in order to avoid non-smooth results).
This is not a simple problem, but there are a lot of methods that have been devised to handle it. One simple solution is to use finite difference methods. The command numpy.diff() uses finite differencing where you can specify the order of the derivative.
Wikipedia also has a page that lists the needed finite differencing coefficients for different derivatives of different accuracies. If the numpy function doesn't do what you want.
Depending on your application you can also use scipy.fftpack.diff which uses a completely different technique to do the same thing. Though your function needs a well defined Fourier transform.
There are lots and lots and lots of variants (e.g. summation by parts, finite differencing operators, or operators designed to preserve known evolution constants in your system of equations) on both of the two ideas above. What you should do will depend a great deal on what the problem is that you are trying to solve.
The good thing is that there is a lot of work has been done in this field. The Wikipedia page for Numerical Differentiation has some resources (though it is focused on finite differencing techniques).
The findiff project is a Python package that can do derivatives of arrays of any dimension with any desired accuracy order (of course depending on your hardware restrictions). It can handle arrays on uniform as well as non-uniform grids and also create generalizations of derivatives, i.e. general linear combinations of partial derivatives with constant and variable coefficients.
Would something like this solve your problem?
def get_inflection_points(arr, n=1):
"""
returns inflextion points from array
arr: array
n: n-th discrete difference
"""
inflections = []
dx = 0
for i, x in enumerate(np.diff(arr, n)):
if x >= dx and i > 0:
inflections.append(i*n)
dx = x
return inflections

autocorrelation function of time series data with numpy

I have been trying to calculate an autocorrelation function, as defined in statistical mechanics, using numpy. Most of the documentation I found is relative to functions like correlate and convolve. However, for a given random variable x these functions just seem to calculate the sum
ACF(dt) = sum_{t=0}^T [(x(t)*x(t+dt)]
instead of the average
ACF(dt) = mean[x(t)*x(t+dt)]
so in fact for calculating an autocorrelation function one would need to do something like:
acf = np.correlate(x,x,mode='full')
acf_half = acf[acf.size / 2:]
ldata = len(acf)
acf = np.array([x/(ldata-i) for i,x in enumerate(acf_half)])
Of course we would need to subtract mean(x)**2 from the resulting acf to be correct.
Can anyone confirm that this is correct?
Generally speaking, the autocorrelation, correlation, etc. is the sum (integral). Sometimes it is normalized, but not averaged in the sense as you've written above. This is because they are defined in terms of the mathematical convolution operation, which is simply the integral that you've written as a sum above.
The brackets at the stat mech page indicate a thermal average, which is an ensemble or time average over the 'experiment' taking place many times at many different states at some temperature. This (the finite temperature) causes the fluctuations that give rise to the 'statistical' nature of the problem, and cause the decay of the correlation (loss of long range order). This simply means that you should find the autocorrelation of several datasets, and average those together, but do not take the mean of the function.
As far as I can tell, your code is attempting to weigh the correlation at dt by the length of the overlap length dt, but I do not believe that this is correct.
With respect to the subtraction of <s>2, that's in the case of the spin model, where <s> would be the mean spin (magnetization), so I believe you are correct in that you should use mean(x)**2.
As a side-note, I would suggest using mode='same' instead of 'full' so that the domain of your correlation matches the domain of your input without having to look at just one-half of the output (here the output is symmetric, so it doesn't really make a difference).

Generalized least square on large dataset

I'd like to linearly fit the data that were NOT sampled independently. I came across generalized least square method:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
The equation is Matlab format; X and Y are coordinates of the data points, and V is a "variance matrix".
The problem is that due to its size (1000 rows and columns), the V matrix becomes singular, thus un-invertable. Any suggestions for how to get around this problem? Maybe using a way of solving generalized linear regression problem other than GLS? The tools that I have available and am (slightly) familiar with are Numpy/Scipy, R, and Matlab.
Instead of:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
Use
b= (X'/V *X)\X'/V*Y
That is, replace all instances of X*(Y^-1) with X/Y. Matlab will skip calculating the inverse (which is hard, and error prone) and compute the divide directly.
Edit: Even with the best matrix manipulation, some operations are not possible (for example leading to errors like you describe).
An example of that which may be relevant to your problem is if try to solve least squares problem under the constraint the multiple measurements are perfectly, 100% correlated. Except in rare, degenerate cases this cannot be accomplished, either in math or physically. You need some independence in the measurements to account for measurement noise or modeling errors. For example, if you have two measurements, each with a variance of 1, and perfectly correlated, then your V matrix would look like this:
V = [1 1; ...
1 1];
And you would never be able to fit to the data. (This generally means you need to reformulate your basis functions, but that's a longer essay.)
However, if you adjust your measurement variance to allow for some small amount of independence between the measurements, then it would work without a problem. For example, 95% correlated measurements would look like this
V = [1 0.95; ...
0.95 1 ];
You can use singular value decomposition as your solver. It'll do the best that can be done.
I usually think about least squares another way. You can read my thoughts here:
http://www.scribd.com/doc/21983425/Least-Squares-Fit
See if that works better for you.
I don't understand how the size is an issue. If you have N (x, y) pairs you still only have to solve for (M+1) coefficients in an M-order polynomial:
y = a0 + a1*x + a2*x^2 + ... + am*x^m

Categories