Python: looking for a faster, less accurate sqrt() function

Python: looking for a faster, less accurate sqrt() function - python

I'm looking for a cheaper, less accurate square root function for a high volume of pythagorus calculations that do not need highly accurate results. Inputs are positive integers, and I can upper-bound the input if necessary. Output to 1dp with accuracy +- 0.1 if good, but I could even get away with output to nearest integer +- 1. Is there anything built into python that can help with this? Something like math.sqrt() that does less approximations perhaps?

As I said in my comment, I do not think you will do much better in speed over math.sqrt in native python given it's linkage to C's sqrt function. However, your question indicates that you need to perform a lot of "Pythagoras calculations". I am assuming you mean you have a lot of triangles with sides a and b and you want to find the c value for all of them. If so, the following will be quick enough for you. This leverages vectorization with numpy:
import numpy as np
all_as = ... # python list of all of your a values
all_bs = ... # python list of all of your b values
cs = np.sqrt(np.array(all_as)**2 + np.array(all_bs)**2).tolist()
if your use-case is different, then please update your question with the kind of data you have and what operation you want.
However, if you really want a python implementation of fast square rooting, you can use Newton'ss method` to do this:
def fast_sqrt(y, tolerance=0.05)
prev = -1.0
x = 1.0
while abs(x - prev) > tolerance: # within range
prev = x
x = x - (x * x - y) / (2 * x)
return x
However, even with a very high tolerance (0.5 is absurd), you will most likely not beat math.sqrt. Although, I have no benchmarks to back this up :) - but I can make them for you (or you can do it too!)

#modesitt was faster than me :)
Newton's method is the way to go, my contribution is an implementation of Newton's method that is a bit faster than the one modesitt suggested (take sqrt(65) for example, the following method will return after 4 iterations vs fast_sqrt which will return after 6 iterations).
def sqrt(x):
delta = 0.1
runner = x / 2
while abs(runner - (x / runner)) > delta:
runner = ((x / runner) + runner) / 2
return runner
That said, math.sqrt will most certainly be faster then any implementation that you'll come with. Let's benchmark the two:
import time
import math
def timeit1():
s = time.time()
for i in range(1, 1000000):
x = sqrt(i)
print("sqrt took %f seconds" % (time.time() - s))
def timeit2():
s = time.time()
for i in range(1, 1000000):
x = math.sqrt(i)
print("math.sqrt took %f seconds" % (time.time() - s))
timeit1()
timeit2()
The output that I got on my machine (Macbook pro):
sqrt took 3.229701 seconds
math.sqrt took 0.074377 seconds

Related

Python exponentiation of two lists performance

Is there a way to compute the Cobb-Douglas utility function faster in Python. I run it millions of time, so a speed increase would help. The function raises elements of quantities_list to power of corresponding elements of exponents list, and then multiplies all the resulting elements.
n = 10
quantities = range(n)
exponents = range(n)
def Cobb_Douglas(quantities_list, exponents_list):
number_of_variables = len(quantities_list)
value = 1
for variable in xrange(number_of_variables):
value *= quantities_list[variable] ** exponents_list[variable]
return value
t0 = time.time()
for i in xrange(100000):
Cobb_Douglas(quantities, exponents)
t1 = time.time()
print t1-t0

Iterators are your friend. I got a 28% speedup on my computer by switching your loop to this:
for q, e in itertools.izip(quantities_list, exponents_list):
value *= q ** e
I also got similar results when switching your loop to a functools.reduce call, so it's not worth providing a code sample.
In general, numpy is the right choice for fast arithmetic operations, but numpy's largest integer type is 64 bits, which won't hold the result for your example. If you're using a different numeric range or arithmetic type, numpy is king:
quantities = np.array(quantities, dtype=np.int64)
exponents = np.array(exponents, dtype=np.int64)
def Cobb_Douglas(quantities_list, exponents_list):
return np.product(np.power(quantities_list, exponents_list))
# result: 2649120435010011136
# actual: 21577941222941856209168026828800000

Couple of suggestions:
Use Numpy
Vectorize your code
If quantities are large and and nothing's going to be zero or negative, work in log-space.
I got about a 15% speedup locally using:
def np_Cobb_Douglas(quantities_list, exponents_list):
return np.product(np.power(quantities_list, exponents_list))
And about 40% using:
def np_log_Cobb_Douglas(quantities_list, exponents_list):
return np.exp(np.dot(np.log(quantities_list), np.log(exponents_list)))
Last but not least, there should be some scaling of your Cobb-Douglas parameters so you don't run into overflow errors (if I'm remembering my intro macro correctly).

Different implementations of Newton's method in floating point arithmetic

I'm solving a one dimensional non-linear equation with Newton's method. I'm trying to figure out why one of the implementations of Newton's method is converging exactly within floating point precision, wheres another is not.
The following algorithm does not converge:
whereas the following does converge:
You may assume that the functions f and f' are smooth and well behaved. The best explanation I was able to come up with is that this is somehow related to what's called iterative improvement (Golub and Van Loan, 1989). Any further insight would be greatly appreciated!
Here is a simple python example illustrating the issue
# Python
def f(x):
return x*x-2.
def fp(x):
return 2.*x
xprev = 0.
# converges
x = 1. # guess
while x != xprev:
xprev = x
x = (x*fp(x)-f(x))/fp(x)
print(x)
# does not converge
x = 1. # guess
while x != xprev:
xprev = x
dx = -f(x)/fp(x)
x = x + dx
print(x)
Note: I'm aware of how floating point numbers work (please don't post your favourite link to a website telling me to never compare two floating point numbers). Also, I'm not looking for a solution to a problem but for an explanation as to why one of the algorithms converges but not the other.
Update:
As #uhoh pointed out, there are many cases where the second method does not converge. However, I still don't know why the second method converges so much more easily in my real world scenario than the first. All the test cases have very simple functions f whereas the real world f has several hundred lines of code (which is why I don't want to post it). So maybe the complexity of f is important. If you have any additional insight into this, let me know!

None of the methods is perfect:
One situation in which both methods will tend to fail is if the root is about exactly midway between two consecutive floating-point numbers f1 and f2. Then both methods, having arrived to f1, will try to compute that intermediate value and have a good chance of turning up f2, and vice versa.
/f(x)
/
/
/
/
f1 /
--+----------------------+------> x
/ f2
/
/
/

"I'm aware of how floating point numbers work...". Perhaps the workings of floating-point arithmetic are more complicated than imagined.
This is a classic example of cycling of iterates using Newton's method. The comparison of a difference to an epsilon is "mathematical thinking" and can burn you when using floating-point. In your example, you visit several floating-point values for x, and then you are trapped in a cycle between two numbers. The "floating-point thinking" is better formulated as the following (sorry, my preferred language is C++)
std::set<double> visited;
xprev = 0.0;
x = 1.0;
while (x != prev)
{
xprev = x;
dx = -F(x)/DF(x);
x = x + dx;
if (visited.find(x) != visited.end())
{
break; // found a cycle
}
visited.insert(x);
}

I'm trying to figure out why one of the implementations of Newton's method is converging exactly within floating point precision, wheres another is not.
Technically, it doesn't converge to the correct value. Try printing more digits, or using float.hex.
The first one gives
>>> print "%.16f" % x
1.4142135623730949
>>> float.hex(x)
'0x1.6a09e667f3bccp+0'
whereas the correctly rounded value is the next floating point value:
>>> print "%.16f" % math.sqrt(2)
1.4142135623730951
>>> float.hex(math.sqrt(2))
'0x1.6a09e667f3bcdp+0'
The second algorithm is actually alternating between the two values, so doesn't converge.
The problem is due to catastrophic cancellation in f(x): as x*x will be very close to 2, when you subtract 2, the result will be dominated by the rounding error incurred in computing x*x.

I think trying to force an exact equal (instead of err < small) is always going to fail frequently. In your example, for 100,000 random numbers between 1 and 10 (instead of your 2.0) the first method fails about 1/3 of the time, the second method about 1/6 of the time. I'll bet there's a way to predict that!
This takes ~30 seconds to run, and the results are cute!:
def f(x, a):
return x*x - a
def fp(x):
return 2.*x
def A(a):
xprev = 0.
x = 1.
n = 0
while x != xprev:
xprev = x
x = (x * fp(x) - f(x,a)) / fp(x)
n += 1
if n >100:
return n, x
return n, x
def B(a):
xprev = 0.
x = 1.
n = 0
while x != xprev:
xprev = x
dx = - f(x,a) / fp(x)
x = x + dx
n += 1
if n >100:
return n, x
return n, x
import numpy as np
import matplotlib.pyplot as plt
n = 100000
aa = 1. + 9. * np.random.random(n)
data_A = np.zeros((2, n))
data_B = np.zeros((2, n))
for i, a in enumerate(aa):
data_A[:,i] = A(a)
data_B[:,i] = B(a)
bins = np.linspace(0, 110, 12)
hist_A = np.histogram(data_A, bins=bins)
hist_B = np.histogram(data_B, bins=bins)
print "A: n<10: ", hist_A[0][0], " n>=100: ", hist_A[0][-1]
print "B: n<10: ", hist_B[0][0], " n>=100: ", hist_B[0][-1]
plt.figure()
plt.subplot(1,2,1)
plt.scatter(aa, data_A[0])
plt.subplot(1,2,2)
plt.scatter(aa, data_B[0])
plt.show()

Python Pi approximation

So I have to approximate Pi with following way: 4*(1-1/3+1/5-1/7+1/9-...). Also it should be based on number of iterations. So the function should look like this:
>>> piApprox(1)
4.0
>>> piApprox(10)
3.04183961893
>>> piApprox(300)
3.13825932952
But it works like this:
>>> piApprox(1)
4.0
>>> piApprox(10)
2.8571428571428577
>>> piApprox(300)
2.673322240709928
What am I doing wrong? Here is the code:
def piApprox(num):
pi=4.0
k=1.0
est=1.0
while 1<num:
k+=2
est=est-(1/k)+1/(k+2)
num=num-1
return pi*est

This is what you're computing:
4*(1-1/3+1/5-1/5+1/7-1/7+1/9...)
You can fix it just by adding a k += 2 at the end of your loop:
def piApprox(num):
pi=4.0
k=1.0
est=1.0
while 1<num:
k+=2
est=est-(1/k)+1/(k+2)
num=num-1
k+=2
return pi*est
Also the way you're counting your iterations is wrong since you're adding two elements at the time.
This is a cleaner version that returns the output that you expect for 10 and 300 iterations:
def approximate_pi(rank):
value = 0
for k in xrange(1, 2*rank+1, 2):
sign = -(k % 4 - 2)
value += float(sign) / k
return 4 * value
Here is the same code but more compact:
def approximate_pi(rank):
return 4 * sum(-float(k%4 - 2) / k for k in xrange(1, 2*rank+1, 2))

Important edit:
whoever expects this approximation to yield PI -- quote from Wikipedia:
It converges quite slowly, though – after 500,000 terms, it produces
only five correct decimal digits of π
Original answer:
This is an educational example. You try to use a shortcut and attempt to implement the "oscillating" sign of the summands by handling two steps for k in the same iteration. However, you adjust k only by one step per iteration.
Usually, in math at least, an oscillating sign is achieved with (-1)**i. So, I have chosen this for a more readable implementation:
def pi_approx(num_iterations):
k = 3.0
s = 1.0
for i in range(num_iterations):
s = s-((1/k) * (-1)**i)
k += 2
return 4 * s
As you can see, I have changed your approach a bit, to improve readability. There is no need for you to check for num in a while loop, and there is no particular need for your pi variable. Your est actually is a sum that grows step by step, so why not call it s ("sum" is a built-in keyword in Python). Just multiply the sum with 4 in the end, according to your formula.
Test:
>>> pi_approx(100)
3.1514934010709914
The convergence, however, is not especially good:
>>> pi_approx(100) - math.pi
0.009900747481198291
Your expected output is flaky somehow, because your piApprox(300) (should be 3.13825932952, according to your) is too far away from PI. How did you come up with that? Is that possibly affected by an accumulated numerical error?
Edit
I would not trust the book too much in regard of what the function should return after 10 and 300 iterations. The intermediate result, after 10 steps, should be rather free of numerical errors, indeed. There, it actually makes a difference whether you take two steps of k at the same time or not. So this most likely is the difference between my pi_approx(10) and the books'. For 300 iterations, numerical error might have severely affected the result in the book. If this is an old book, and they have implemented their example in C, possibly using single precision, then a significant portion of the result may be due to accumulation of numerical error (note: this is a prime example for how bad you can be affected by numerical errors: a repeated sum of small and large values, it does not get worse!).
What counts is that you have looked at the math (the formula for PI), and you have implemented a working Python version of approximating that formula. That was the learning goal of the book, so go ahead and tackle the next problem :-).

def piApprox(num):
pi=4.0
k=3.0
est=1.0
while 1<num:
est=est-(1/k)+1/(k+2)
num=num-1
k+=4
return pi*est
Also for real task use math.pi

Here is a slightly simpler version:
def pi_approx(num_terms):
sign = 1. # +1. or -1.
pi_by_4 = 1. # first term
for div in range(3, 2 * num_terms, 2): # 3, 5, 7, ...
sign = -sign # flip sign
pi_by_4 += sign / div # add next term
return 4. * pi_by_4
which gives
>>> for n in [1, 10, 300, 1000, 3000]:
... print(pi_approx(n))
4.0
3.0418396189294032
3.1382593295155914
3.140592653839794
3.1412593202657186

While all of these answers are perfectly good approximations, if you are using the Madhava-Leibniz Series than you should arrive at ,"an approximation of π correct to 11 decimal places as 3.14159265359" within in first 21 terms according to this website: https://en.wikipedia.org/wiki/Approximations_of_%CF%80
Therefore, a more accurate solution could be any variation of this:
import math
def estimate_pi(terms):
ans = 0.0
for k in range(terms):
ans += (-1.0/3.0)**k/(2.0*k+1.0)
return math.sqrt(12)*ans
print(estimate_pi(21))
Output: 3.141592653595635

Why is pow(a, d, n) so much faster than a**d % n?

I was trying to implement a Miller-Rabin primality test, and was puzzled why it was taking so long (> 20 seconds) for midsize numbers (~7 digits). I eventually found the following line of code to be the source of the problem:
x = a**d % n
(where a, d, and n are all similar, but unequal, midsize numbers, ** is the exponentiation operator, and % is the modulo operator)
I then I tried replacing it with the following:
x = pow(a, d, n)
and it by comparison it is almost instantaneous.
For context, here is the original function:
from random import randint
def primalityTest(n, k):
if n < 2:
return False
if n % 2 == 0:
return False
s = 0
d = n - 1
while d % 2 == 0:
s += 1
d >>= 1
for i in range(k):
rand = randint(2, n - 2)
x = rand**d % n # offending line
if x == 1 or x == n - 1:
continue
for r in range(s):
toReturn = True
x = pow(x, 2, n)
if x == 1:
return False
if x == n - 1:
toReturn = False
break
if toReturn:
return False
return True
print(primalityTest(2700643,1))
An example timed calculation:
from timeit import timeit
a = 2505626
d = 1520321
n = 2700643
def testA():
print(a**d % n)
def testB():
print(pow(a, d, n))
print("time: %(time)fs" % {"time":timeit("testA()", setup="from __main__ import testA", number=1)})
print("time: %(time)fs" % {"time":timeit("testB()", setup="from __main__ import testB", number=1)})
Output (run with PyPy 1.9.0):
2642565
time: 23.785543s
2642565
time: 0.000030s
Output (run with Python 3.3.0, 2.7.2 returns very similar times):
2642565
time: 14.426975s
2642565
time: 0.000021s
And a related question, why is this calculation almost twice as fast when run with Python 2 or 3 than with PyPy, when usually PyPy is much faster?

See the Wikipedia article on modular exponentiation. Basically, when you do a**d % n, you actually have to calculate a**d, which could be quite large. But there are ways of computing a**d % n without having to compute a**d itself, and that is what pow does. The ** operator can't do this because it can't "see into the future" to know that you are going to immediately take the modulus.

BrenBarn answered your main question. For your aside:
why is it almost twice as fast when run with Python 2 or 3 than PyPy, when usually PyPy is much faster?
If you read PyPy's performance page, this is exactly the kind of thing PyPy is not good at—in fact, the very first example they give:
Bad examples include doing computations with large longs – which is performed by unoptimizable support code.
Theoretically, turning a huge exponentiation followed by a mod into a modular exponentiation (at least after the first pass) is a transformation a JIT might be able to make… but not PyPy's JIT.
As a side note, if you need to do calculations with huge integers, you may want to look at third-party modules like gmpy, which can sometimes be much faster than CPython's native implementation in some cases outside the mainstream uses, and also has a lot of additional functionality that you'd otherwise have to write yourself, at the cost of being less convenient.

There are shortcuts to doing modular exponentiation: for instance, you can find a**(2i) mod n for every i from 1 to log(d) and multiply together (mod n) the intermediate results you need. A dedicated modular-exponentiation function like 3-argument pow() can leverage such tricks because it knows you're doing modular arithmetic. The Python parser can't recognize this given the bare expression a**d % n, so it will perform the full calculation (which will take much longer).

The way x = a**d % n is calculated is to raise a to the d power, then modulo that with n. Firstly, if a is large, this creates a huge number which is then truncated. However, x = pow(a, d, n) is most likely optimized so that only the last n digits are tracked, which are all that are required for calculating multiplication modulo a number.

python: Generating integer partitions

I need to generate all the partitions of a given integer.
I found this algorithm by Jerome Kelleher for which it is stated to be the most efficient one:
def accelAsc(n):
a = [0 for i in range(n + 1)]
k = 1
a[0] = 0
y = n - 1
while k != 0:
x = a[k - 1] + 1
k -= 1
while 2*x <= y:
a[k] = x
y -= x
k += 1
l = k + 1
while x <= y:
a[k] = x
a[l] = y
yield a[:k + 2]
x += 1
y -= 1
a[k] = x + y
y = x + y - 1
yield a[:k + 1]
reference: http://homepages.ed.ac.uk/jkellehe/partitions.php
By the way, it is not quite efficient. For an input like 40 it freezes nearly my whole system for few seconds before giving its output.
If it was a recursive algorithm I would try to decorate it with a caching function or something to improve its efficiency, but being like that I can't figure out what to do.
Do you have some suggestions about how to speed up this algorithm? Or can you suggest me another one, or a different approach to make another one from scratch?

To generate compositions directly you can use the following algorithm:
def ruleGen(n, m, sigma):
"""
Generates all interpart restricted compositions of n with first part
>= m using restriction function sigma. See Kelleher 2006, 'Encoding
partitions as ascending compositions' chapters 3 and 4 for details.
"""
a = [0 for i in range(n + 1)]
k = 1
a[0] = m - 1
a[1] = n - m + 1
while k != 0:
x = a[k - 1] + 1
y = a[k] - 1
k -= 1
while sigma(x) <= y:
a[k] = x
x = sigma(x)
y -= x
k += 1
a[k] = x + y
yield a[:k + 1]
This algorithm is very general, and can generate partitions and compositions of many different types. For your case, use
ruleGen(n, 1, lambda x: 1)
to generate all unrestricted compositions. The third argument is known as the restriction function, and describes the type of composition/partition that you require. The method is efficient, as the amount of effort required to generate each composition is constant, when you average over all the compositions generated. If you would like to make it slightly faster in python then it's easy to replace the function sigma with 1.
It's worth noting here as well that for any constant amortised time algorithm, what you actually do with the generated objects will almost certainly dominate the cost of generating them. For example, if you store all the partitions in a list, then the time spent managing the memory for this big list will be far greater than the time spent generating the partitions.
Say, for some reason, you want to take the product of each partition. If you take a naive approach to this, then the processing involved is linear in the number of parts, whereas the cost of generation is constant. It's quite difficult to think of an application of a combinatorial generation algorithm in which the processing doesn't dominate the cost of generation. So, in practice, there'll be no measurable difference between using the simpler and more general ruleGen with sigma(x) = x and the specialised accelAsc.

If you are going to use this function repeatedly for the same inputs, it could still be worth caching the return values (if you are going to use it across separate runs, you could store the results in a file).
If you can't find a significantly faster algorithm, then it should be possible to speed this up by an order of magnitude or two by moving the code into a C extension (this is probably easiest using cython), or alternatively by using PyPy instead of CPython (PyPy has its downsides - it does not yet support Python 3, or some commonly-used libraries like numpy and scipy).
The reason for this is, since python is dynamically typed, the interpreter is probably spending most of its time checking the types of the variables - for all the interpreter knows, one of the operations could turn x into a string, in which case expressions like x + y would suddenly have very different meanings. Cython gets around this problem by allowing you to statically declare the variables as integers, while PyPy has a just-in-time compiler which minimises redundant type checks.

Testing with n=75 I get:
PyPy 1.8:
w:\>c:\pypy-1.8\pypy.exe pstst.py
1.04800009727 secs.
CPython 2.6:
w:\>python pstst.py
5.86199998856 secs.
Cython + mingw + gcc 4.6.2:
w:\pstst> python -c "import pstst;pstst.run()"
4.06399989128
I saw no difference with Psyco(?)
The run function:
def run():
import time
start = time.time()
for p in accelAsc(75):
pass
print time.time() - start, 'secs.'
If I change the definition of accelAsc for Cython to start with:
def accelAsc(int n):
cdef int x, y, k
# no more changes..
I get the Cython time down to 2.27 secs.

I'd say that your performance issue is somewhere else.
I didn't compare it with other approaches, but it does seem efficient to me:
import time
start = time.time()
partitions = list(accelAsc(40))
print('time: {:.5f} sec'.format(time.time() - start))
print('length:', len(partitions))
Gave:
time: 0.03636 sec
length: 37338

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: looking for a faster, less accurate sqrt() function - python

Related

Python exponentiation of two lists performance

Different implementations of Newton's method in floating point arithmetic

Python Pi approximation

Why is pow(a, d, n) so much faster than a**d % n?

python: Generating integer partitions

Categories

Resources