I'm trying to implement Haversine's formula to determine whether a given location's latitude and longitude is within a specified radius. I used a formula detailed here:
Calculate distance between two latitude-longitude points? (Haversine formula)
I am experiencing a Math Domain Error with the following re-producible input, it does not happen all the time, but often enough to make me think I've written in-correct code:
from math import atan2, sqrt, sin, cos
# All long / lat values are in radians and of type float
centerLongitude = -0.0391412861306467
centerLatitude = 0.9334153362515779
inputLatitudeValue = -0.6096173085842176
inputLongitudeValue = 2.4190393564390438
longitudeDelta = inputLongitudeValue - centerLongitude # 2.4581806425696904
latitudeDelta = inputLatitudeValue - centerLatitude # -1.5430326448357956
a = (sin(latitudeDelta / 2) ** 2 + cos(centerLatitude) * cos(centerLongitude)
* sin(longitudeDelta / 2) ** 2)
# a = 1.0139858858386017
c = 2 * atan2(sqrt(a), sqrt(1 - a)) # Error occurs on this line
# Check whether distance is within our specified radius below
You cannot use sqrt on negative numbers:
>>> sqrt(-1)
ValueError: math domain error
use cmath.srt:
>>> import cmath
>>> cmath.sqrt(-1)
1j
in your case:
>>> a = 1.0139858858386017
>>> sqrt(1-a)
ValueError: math domain error
Speaking generally, and said simply... the variable a must be protected. It must never be greater than 1.0, and it must never be less than 0.0, and normally, with clean data, it should be properly in this range.
The problem is with how the common popular computer implementation of floating point arithmetic does its approximation and roundings, and how those results can sometimes be out of range or out of domain for a builtin math function. It is not particularly common that the combination of operations results in a number that is out of domain for a following math function, but in those cases where it does occur, depending on the algorithm, it is consistently reproducable, and needs to be accounted for and mitigated, beyond a normal theoretical algorithm intuition. What we code in computers is subject to how theoretical concepts have been implemented for the software and hardware. On paper, with pencil, we still must have guidelines for when and how to round floating point math results. Computer implementations are no different, but sometimes we are blissfully unaware of such things going on under the hood. Popular computer implementations not only need to know at what precision to round, but also are dealing with how and when to approximate and round in the conversion to and from the binary number representations that are actually calculated at the machine level.
Regarding the haversine formula, what a am I speaking about?
as seen in this version of code (for reference):
import math
a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
In this above example, the a is not properly protected. It is a lurking problem waiting to crash the c calculation on the next line, in certain conditions presented particularly to the haversine formula.
If some latlon combination of data results in
a = 1.00000000000000000000000000000000000001
the following c calculation with cause an error.
If some latlon combination of data results in
a = -0.00000000000000000000000000000000000000000000001
the following c calculation will cause an error.
It is the floating point math implementation of your language/platform and its method of rounding approximation that can cause the rare but actual and consistently repeatable ever so slightly out of domain condition that causes the error in an unprotected haversine coding.
Years ago I did a three day brute force test of relative angular distances between 0 and 1, and 179 and 180, with VERY small step values. The radius of the sphere is one, a unit sphere, so that radius values are irrelevant. Of course, finding approximate distances on the surface of the sphere in any unit other than angular distance would require including the radius of the sphere in its units. But I was testing the haversine logic implementation itself, and a radius of 1 eliminated a complication. When the relative angular distances are 0 -1, or 179-180, these are the conditions where haversine can have difficulties, implemented with popular floating point implementations that involve converting to and from binary at a low system level, if the a is not protected. Haversine is supposed to be well conditioned for small angular distances, theoretically, but a machine or software implementation of FPA (floating point arithmetic) is not always precisely cooperative with the ideals of a spherical geometry theory. After 3 days of my brute force test, there were logged thousands of latlon combinations that would crash the unprotected popularly posted haversine formula, because the a was not protected. You must protect the a. If it goes above 1.0, or below 0.0, even the very slightest bit, all one need do is simply test for that condition and nudge it back into range. Simple.
I protected an a of -0.000000000000000000002 or in other words, if a < 0.0, by reassigning the value 0.0 to it, and also setting another flag that I could inspect so I would know if a protective action had been necessary for that data.
I protected an a of 1.000000000000000000002 or in other words, if a > 1.0, by reassigning the value 1.0 to it, and also setting another flag that I could inspect so I would know if a protective action had been necessary for that data.
It is a simple 2 extra lines before the c calculation. You could squeeze it all on one line. The protective line(s) of code go after the a calculation, and before the c calculation. Protect your a.
Are you then loosing precision with those slight nudges? No more than what the floating point math with that data is already introducing with its approximations and rounding. It crashed with data that should not have crashed a pure theoretical algorithm, one that doesn't have FPA rare issues. Simply protect the a, and this should mitigate those errors, with haversine in this form. There are alternatives to haversine, but haversine is completely suitable for many uses, if one understands where it is well suited. I use it for skysphere calculations where the ellipsoid shape of the earth has nothing to do with anything. Haversine is simple and fast. But remember to protect your a.
Related
I'm looking for a way to disable this:
print(0+1e-20) returns 1e-20, but print(1+1e-20) returns 1.0
I want it to return something like 1+1e-20.
I need it because of this problem:
from numpy import sqrt
def f1(x):
return 1/((x+1)*sqrt(x))
def f2(x):
return 1/((x+2)*sqrt(x+1))
def f3(x):
return f2(x-1)
print(f1(1e-6))
print(f3(1e-6))
print(f1(1e-20))
print(f3(1e-20))
returns
999.9990000010001
999.998999986622
10000000000.0
main.py:10: RuntimeWarning: divide by zero encountered in double_scalars
return 1/((x+2)*sqrt(x+1))
inf
f1 is the original function, f2 is f1 shifted by 1 to the left and f3 is f2 moved back by 1 to the right. By this logic, f1 and f3 should be equal to eachother, but this is not the case.
I know about decimal. Decimal and it doesn't work, because decimal doesn't support some functions including sin. If you could somehow make Decimal work for all functions, I'd like to know how.
Can't be done. There is no rounding - that would imply that the exact result ever existed, which it did not. Floating-point numbers have limited precision. The easiest way to envision this is an analogy with art and computer screens. Imagine someone making a fabulously detailed painting, and all you have is a 1024x768 screen to view it through. If a microscopic dot is added to the painting, the image on the screen might not change at all. Maybe you need a 4K screen instead.
In Python, the closest representable number after 1.0 is 1.0000000000000002 (*), according to math.nextafter(1.0, math.inf) (Python 3.9+ required for math.nextafter).1e-20 and 1 are too different in magnitude, so the result of their addition cannot be represented by a Python floating-point number, which are precise to up to about 16 digits.
See Is floating point math broken? for an in-depth explanation of the cause.
As this answer suggests, there are libraries like mpmath that implement arbitrary-precision arithmetics:
from mpmath import mp, mpf
mp.dps = 25 # set precision to 25 decimal digits
mp.sin(1)
# => mpf('0.8414709848078965066525023183')
mp.sin(1 + mpf('1e-20')) # mpf is a constructor for mpmath floats
# => mpf('0.8414709848078965066579053457')
mpmath floats are sticky; if you add up an int and a mpf you get a mpf, so I did not have to write mp.mpf(1). The result is still not precise, but you can select what precision is sufficient for your needs. Also, note that the difference between these two results is, again, too small to be representable by Python's floating point numbers, so if the difference is meaningful to you, you have to keep it in the mpmath land:
float(mpf('0.8414709848078965066525023183')) == float(mpf('0.8414709848078965066579053457'))
# => True
(*) This is actually a white lie. The next number after 1.0 is 0x1.0000000000001, or 0b1.0000000000000000000000000000000000000000000000001, but Python doesn't like hexadecimal or binary float literals. 1.0000000000000002 is Python's approximation of that number for your decimal convenience.
As others have stated, in general this can't be done (due to how computers commonly represent numbers).
It's common to work with the precision you've got, ensuring that algorithms are numerically stable can be awkward.
In this case I'd redefine f1 to work on the logarithms of numbers, e.g.:
from numpy as sqrt, log, log1p,
def f1(x):
prod = log1p(x) + log(x) / 2
return exp(-prod)
You might need to alter other parts of the code to work in log space as well depending on what you need to do. Note that most stats algorithms work with log-probabilities because it's much more compatible with how computers represent numbers.
f3 is a bit more work due to the subtraction.
I am trying to find roots of a function in python using fsolve:
import math
import scipy
def f(a):
eq=-2*a**2 - 2*a**2*(math.sin(25*a**(1/4)))**2 - 2*a**2*(math.cos(25*a**(1/4)))**2 - 2*math.exp(-25*a**(1/4))*a**2*math.cos(25*a**(1/4)) - 2*math.exp(25*a**(1/4))*a**2*math.cos(25*a**(1/4))
return eq
print(f(scipy.optimize.fsolve(f,10)))
and it returns the following value:
[1234839.75468454]
That doesn't seem very close to 0 to me... Does it simply lack the computational power to calculate more decimals for the root? If so, what would be a good alternative for fsolve that could also calculate roots, just more accurately?
To better understand what happens, a first step would be to look at the infos of the run, which you can get by setting the full_output argument to True (see https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html for more details).
When starting with an initial point of 10 as you do, the algorithm claims to converge and do so in less evaluations than the maximum allocated, so it is not a problem of computing power.
I run into an OOP problem when coding something in python that I don't know how to address in an elegant solution. I have a class that represents the equation of a line (y = mx + b) based on the m and b parameters, called Line. Vertical lines have infinite slope, and have equation x = c, so there is another class VerticalLine which only requires a c parameter. Note that I am unable to have a Line class that is represented by two points in the xy-plane, if this were a solution I would indeed use it.
I want to be able to rotate the lines. Rotating a horizontal line by pi/2 + k*pi (k an integer) results in a vertical line, and vice versa. So a normal Line would have to somehow be converted to a VerticalLine in-place, which is impossible in python (well, not impossible but incredibly wonky). How can I better structure my program to account for this problem?
Note that other geometric objects in the program have a rotation method that is in-place, and they are already used frequently, so if I could I would like the line rotation methods to also be in place. Indeed, this would be a trivial problem if the line rotation methods could return a new rotated Line or VerticalLine object as seen fit.
Geometric objects that have a fixed boundary/end-points can be translated and rotated in place. But for a line, unless you talk about a line from point A to point B with a fixed length, you are looking at both end-points either being at infinity or -infinity (y = mx + c). Division using infinity or -infinity is not simple math and hence I believe complicates the rotation and translation algorithms
Which programming language or a library is able to process infinite series (like geometric or harmonic)? It perhaps must have a database of some well-known series and automatically give proper values in case of convergence, and maybe generate an exception in case of divergence.
For example, in Python it could look like:
sum = 0
sign = -1.0
for i in range(1,Infinity,2):
sign = -sign
sum += sign / i
then, sum must be math.pi/4 without doing any computations in the loop (because it's a well-known sum).
Most functional languages which evaluate lazily can simulate the processing of infinite series. Of course, on a finite computer it is not possible to process infinite series, as I am sure you are aware. Off the top of my head, I guess Mathematica can do most of what you might want, I suspect that Maple can too, maybe Sage and other computer-algebra systems and I'd be surprised if you can't find a Haskell implementation that suits you.
EDIT to clarify for OP: I do not propose generating infinite loops. Lazy evaluation allows you to write programs (or functions) which simulate infinite series, programs which themselves are finite in time and space. With such languages you can determine many of the properties, such as convergence, of the simulated infinite series with considerable accuracy and some degree of certainty. Try Mathematica or, if you don't have access to it, try Wolfram Alpha to see what one system can do for you.
One place to look might be the Wikipedia category of Computer Algebra Systems.
There are two tools available in Haskell for this beyond simply supporting infinite lists.
First there is a module that supports looking up sequences in OEIS. This can be applied to the first few terms of your series and can help you identify a series for which you don't know the closed form, etc. The other is the 'CReal' library of computable reals. If you have the ability to generate an ever improving bound on your value (i.e. by summing over the prefix, you can declare that as a computable real number which admits a partial ordering, etc. In many ways this gives you a value that you can use like the sum above.
However in general computing the equality of two streams requires an oracle for the halting problem, so no language will do what you want in full generality, though some computer algebra systems like Mathematica can try.
Maxima can calculate some infinite sums, but in this particular case it doesn't seem to find the answer :-s
(%i1) sum((-1)^k/(2*k), k, 1, inf), simpsum;
inf
==== k
\ (- 1)
> ------
/ k
====
k = 1
(%o1) ------------
2
but for example, those work:
(%i2) sum(1/(k^2), k, 1, inf), simpsum;
2
%pi
(%o2) ----
6
(%i3) sum((1/2^k), k, 1, inf), simpsum;
(%o3) 1
You can solve the series problem in Sage (a free Python-based math software system) exactly as follows:
sage: k = var('k'); sum((-1)^k/(2*k+1), k, 1, infinity)
1/4*pi - 1
Behind the scenes, this is really using Maxima (a component of Sage).
For Python check out SymPy - clone of Mathematica and Matlab.
There is also a heavier Python-based math-processing tool called Sage.
You need something that can do a symbolic computation like Mathematica.
You can also consider quering wolframaplha: sum((-1)^i*1/i, i, 1 , inf)
There is a library called mpmath(python), a module of sympy, which provides the series support for sympy( I believe it also backs sage).
More specifically, all of the series stuff can be found here: Series documentation
The C++ iRRAM library performs real arithmetic exactly. Among other things it can compute limits exactly using the limit function. The homepage for iRRAM is here. Check out the limit function in the documentation. Note that I'm not talking about arbitrary precision arithmetic. This is exact arithmetic, for a sensible definition of exact. Here's their code to compute e exactly, pulled from the example on their web site:
//---------------------------------------------------------------------
// Compute an approximation to e=2.71.. up to an error of 2^p
REAL e_approx (int p)
{
if ( p >= 2 ) return 0;
REAL y=1,z=2;
int i=2;
while ( !bound(y,p-1) ) {
y=y/i;
z=z+y;
i+=1;
}
return z;
};
//---------------------------------------------------------------------
// Compute the exact value of e=2.71..
REAL e()
{
return limit(e_approx);
};
Clojure and Haskell off the top of my head.
Sorry I couldn't find a better link to haskell's sequences, if someone else has it, please let me know and I'll update.
Just install sympy on your computer. Then do the following code:
from sympy.abc import i, k, m, n, x
from sympy import Sum, factorial, oo, IndexedBase, Function
Sum((-1)**k/(2*k+1), (k, 0, oo)).doit()
Result will be: pi/4
I have worked in couple of Huge Data Series for Research purpose.
I used Matlab for that. I didn't know it can/can't process Infinite Series.
But I think there is a possibility.
U can try :)
This can be done in for instance sympy and sage (among open source alternatives) In the following, a few examples using sympy:
In [10]: summation(1/k**2,(k,1,oo))
Out[10]:
2
π
──
6
In [11]: summation(1/k**4, (k,1,oo))
Out[11]:
4
π
──
90
In [12]: summation( (-1)**k/k, (k,1,oo))
Out[12]: -log(2)
In [13]: summation( (-1)**(k+1)/k, (k,1,oo))
Out[13]: log(2)
Behind the scenes, this is using the theory for hypergeometric series, a nice introduction is the book "A=B" by Marko Petkovˇeks, Herbert S. Wilf
and Doron Zeilberger which you can find by googling. ¿What is a hypergeometric series?
Everybody knows what an geometric series is: $X_1, x_2, x_3, \dots, x_k, \dots $ is geometric if the contecutive terms ratio $x_{k+1}/x_k$ is constant. It is hypergeometric if the consecutive terms ratio is a rational function in $k$! sympy can handle basically all infinite sums where this last condition is fulfilled, but only very few others.
In attempting to use scipy's quad method to integrate a gaussian (lets say there's a gaussian method named gauss), I was having problems passing needed parameters to gauss and leaving quad to do the integration over the correct variable. Does anyone have a good example of how to use quad w/ a multidimensional function?
But this led me to a more grand question about the best way to integrate a gaussian in general. I didn't find a gaussian integrate in scipy (to my surprise). My plan was to write a simple gaussian function and pass it to quad (or maybe now a fixed width integrator). What would you do?
Edit: Fixed-width meaning something like trapz that uses a fixed dx to calculate areas under a curve.
What I've come to so far is a method make___gauss that returns a lambda function that can then go into quad. This way I can make a normal function with the average and variance I need before integrating.
def make_gauss(N, sigma, mu):
return (lambda x: N/(sigma * (2*numpy.pi)**.5) *
numpy.e ** (-(x-mu)**2/(2 * sigma**2)))
quad(make_gauss(N=10, sigma=2, mu=0), -inf, inf)
When I tried passing a general gaussian function (that needs to be called with x, N, mu, and sigma) and filling in some of the values using quad like
quad(gen_gauss, -inf, inf, (10,2,0))
the parameters 10, 2, and 0 did NOT necessarily match N=10, sigma=2, mu=0, which prompted the more extended definition.
The erf(z) in scipy.special would require me to define exactly what t is initially, but it nice to know it is there.
Okay, you appear to be pretty confused about several things. Let's start at the beginning: you mentioned a "multidimensional function", but then go on to discuss the usual one-variable Gaussian curve. This is not a multidimensional function: when you integrate it, you only integrate one variable (x). The distinction is important to make, because there is a monster called a "multivariate Gaussian distribution" which is a true multidimensional function and, if integrated, requires integrating over two or more variables (which uses the expensive Monte Carlo technique I mentioned before). But you seem to just be talking about the regular one-variable Gaussian, which is much easier to work with, integrate, and all that.
The one-variable Gaussian distribution has two parameters, sigma and mu, and is a function of a single variable we'll denote x. You also appear to be carrying around a normalization parameter n (which is useful in a couple of applications). Normalization parameters are usually not included in calculations, since you can just tack them back on at the end (remember, integration is a linear operator: int(n*f(x), x) = n*int(f(x), x) ). But we can carry it around if you like; the notation I like for a normal distribution is then
N(x | mu, sigma, n) := (n/(sigma*sqrt(2*pi))) * exp((-(x-mu)^2)/(2*sigma^2))
(read that as "the normal distribution of x given sigma, mu, and n is given by...") So far, so good; this matches the function you've got. Notice that the only true variable here is x: the other three parameters are fixed for any particular Gaussian.
Now for a mathematical fact: it is provably true that all Gaussian curves have the same shape, they're just shifted around a little bit. So we can work with N(x|0,1,1), called the "standard normal distribution", and just translate our results back to the general Gaussian curve. So if you have the integral of N(x|0,1,1), you can trivially calculate the integral of any Gaussian. This integral appears so frequently that it has a special name: the error function erf. Because of some old conventions, it's not exactly erf; there are a couple additive and multiplicative factors also being carried around.
If Phi(z) = integral(N(x|0,1,1), -inf, z); that is, Phi(z) is the integral of the standard normal distribution from minus infinity up to z, then it's true by the definition of the error function that
Phi(z) = 0.5 + 0.5 * erf(z / sqrt(2)).
Likewise, if Phi(z | mu, sigma, n) = integral( N(x|sigma, mu, n), -inf, z); that is, Phi(z | mu, sigma, n) is the integral of the normal distribution given parameters mu, sigma, and n from minus infinity up to z, then it's true by the definition of the error function that
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma * sqrt(2)))).
Take a look at the Wikipedia article on the normal CDF if you want more detail or a proof of this fact.
Okay, that should be enough background explanation. Back to your (edited) post. You say "The erf(z) in scipy.special would require me to define exactly what t is initially". I have no idea what you mean by this; where does t (time?) enter into this at all? Hopefully the explanation above has demystified the error function a bit and it's clearer now as to why the error function is the right function for the job.
Your Python code is OK, but I would prefer a closure over a lambda:
def make_gauss(N, sigma, mu):
k = N / (sigma * math.sqrt(2*math.pi))
s = -1.0 / (2 * sigma * sigma)
def f(x):
return k * math.exp(s * (x - mu)*(x - mu))
return f
Using a closure enables precomputation of constants k and s, so the returned function will need to do less work each time it's called (which can be important if you're integrating it, which means it'll be called many times). Also, I have avoided any use of the exponentiation operator **, which is slower than just writing the squaring out, and hoisted the divide out of the inner loop and replaced it with a multiply. I haven't looked at all at their implementation in Python, but from my last time tuning an inner loop for pure speed using raw x87 assembly, I seem to remember that adds, subtracts, or multiplies take about 4 CPU cycles each, divides about 36, and exponentiation about 200. That was a couple years ago, so take those numbers with a grain of salt; still, it illustrates their relative complexity. As well, calculating exp(x) the brute-force way is a very bad idea; there are tricks you can take when writing a good implementation of exp(x) that make it significantly faster and more accurate than a general a**b style exponentiation.
I've never used the numpy version of the constants pi and e; I've always stuck with the plain old math module's versions. I don't know why you might prefer either one.
I'm not sure what you're going for with the quad() call. quad(gen_gauss, -inf, inf, (10,2,0)) ought to integrate a renormalized Gaussian from minus infinity to plus infinity, and should always spit out 10 (your normalization factor), since the Gaussian integrates to 1 over the real line. Any answer far from 10 (I wouldn't expect exactly 10 since quad() is only an approximation, after all) means something is screwed up somewhere... hard to say what's screwed up without knowing the actual return value and possibly the inner workings of quad().
Hopefully that has demystified some of the confusion, and explained why the error function is the right answer to your problem, as well as how to do it all yourself if you're curious. If any of my explanation wasn't clear, I suggest taking a quick look at Wikipedia first; if you still have questions, don't hesitate to ask.
scipy ships with the "error function", aka Gaussian integral:
import scipy.special
help(scipy.special.erf)
The gaussian distribution is also called a normal distribution. The cdf function in the scipy norm module does what you want.
from scipy.stats import norm
print norm.cdf(0.0)
>>>0.5
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm
Why not just always do your integration from -infinity to +infinity, so that you always know the answer? (joking!)
My guess is that the only reason that there's not already a canned Gaussian function in SciPy is that it's a trivial function to write. Your suggestion about writing your own function and passing it to quad to integrate sounds excellent. It uses the accepted SciPy tool for doing this, it's minimal code effort for you, and it's very readable for other people even if they've never seen SciPy.
What exactly do you mean by a fixed-width integrator? Do you mean using a different algorithm than whatever QUADPACK is using?
Edit: For completeness, here's something like what I'd try for a Gaussian with the mean of 0 and standard deviation of 1 from 0 to +infinity:
from scipy.integrate import quad
from math import pi, exp
mean = 0
sd = 1
quad(lambda x: 1 / ( sd * ( 2 * pi ) ** 0.5 ) * exp( x ** 2 / (-2 * sd ** 2) ), 0, inf )
That's a little ugly because the Gaussian function is a little long, but still pretty trivial to write.
I assume you're handling multivariate Gaussians; if so, SciPy already has the function you're looking for: it's called MVNDIST ("MultiVariate Normal DISTribution). The SciPy documentation is, as ever, terrible, so I can't even find where the function is buried, but it's in there somewhere. The documentation is easily the worst part of SciPy, and has frustrated me to no end in the past.
Single-variable Gaussians just use the good old error function, of which many implementations are available.
As for attacking the problem in general, yes, as James Thompson mentions, you just want to write your own gaussian distribution function and feed it to quad(). If you can avoid the generalized integration, though, it's a good idea to do so -- specialized integration techniques for a particular function (like MVNDIST uses) are going to be much faster than a standard Monte Carlo multidimensional integration, which can be extremely slow for high accuracy.