I am trying to implement the Karatsuba multiplication algorithm in c++ but right now I am just trying to get it to work in python.
Here is my code:
def mult(x, y, b, m):
if max(x, y) < b:
return x * y
bm = pow(b, m)
x0 = x / bm
x1 = x % bm
y0 = y / bm
y1 = y % bm
z2 = mult(x1, y1, b, m)
z0 = mult(x0, y0, b, m)
z1 = mult(x1 + x0, y1 + y0, b, m) - z2 - z0
return mult(z2, bm ** 2, b, m) + mult(z1, bm, b, m) + z0
What I don't get is: how should z2, z1, and z0 be created? Is using the mult function recursively correct? If so, I'm messing up somewhere because the recursion isn't stopping.
Can someone point out where the error is?
NB: the response below addresses directly the OP's question about
excessive recursion, but it does not attempt to provide a correct
Karatsuba algorithm. The other responses are far more informative in
this regard.
Try this version:
def mult(x, y, b, m):
bm = pow(b, m)
if min(x, y) <= bm:
return x * y
# NOTE the following 4 lines
x0 = x % bm
x1 = x / bm
y0 = y % bm
y1 = y / bm
z0 = mult(x0, y0, b, m)
z2 = mult(x1, y1, b, m)
z1 = mult(x1 + x0, y1 + y0, b, m) - z2 - z0
retval = mult(mult(z2, bm, b, m) + z1, bm, b, m) + z0
assert retval == x * y, "%d * %d == %d != %d" % (x, y, x * y, retval)
return retval
The most serious problem with your version is that your calculations of x0 and x1, and of y0 and y1 are flipped. Also, the algorithm's derivation does not hold if x1 and y1 are 0, because in this case, a factorization step becomes invalid. Therefore, you must avoid this possibility by ensuring that both x and y are greater than b**m.
EDIT: fixed a typo in the code; added clarifications
EDIT2:
To be clearer, commenting directly on your original version:
def mult(x, y, b, m):
# The termination condition will never be true when the recursive
# call is either
# mult(z2, bm ** 2, b, m)
# or mult(z1, bm, b, m)
#
# Since every recursive call leads to one of the above, you have an
# infinite recursion condition.
if max(x, y) < b:
return x * y
bm = pow(b, m)
# Even without the recursion problem, the next four lines are wrong
x0 = x / bm # RHS should be x % bm
x1 = x % bm # RHS should be x / bm
y0 = y / bm # RHS should be y % bm
y1 = y % bm # RHS should be y / bm
z2 = mult(x1, y1, b, m)
z0 = mult(x0, y0, b, m)
z1 = mult(x1 + x0, y1 + y0, b, m) - z2 - z0
return mult(z2, bm ** 2, b, m) + mult(z1, bm, b, m) + z0
Usually big numbers are stored as arrays of integers. Each integer represents one digit. This approach allows to multiply any number by the power of base with simple left shift of the array.
Here is my list-based implementation (may contain bugs):
def normalize(l,b):
over = 0
for i,x in enumerate(l):
over,l[i] = divmod(x+over,b)
if over: l.append(over)
return l
def sum_lists(x,y,b):
l = min(len(x),len(y))
res = map(operator.add,x[:l],y[:l])
if len(x) > l: res.extend(x[l:])
else: res.extend(y[l:])
return normalize(res,b)
def sub_lists(x,y,b):
res = map(operator.sub,x[:len(y)],y)
res.extend(x[len(y):])
return normalize(res,b)
def lshift(x,n):
if len(x) > 1 or len(x) == 1 and x[0] != 0:
return [0 for i in range(n)] + x
else: return x
def mult_lists(x,y,b):
if min(len(x),len(y)) == 0: return [0]
m = max(len(x),len(y))
if (m == 1): return normalize([x[0]*y[0]],b)
else: m >>= 1
x0,x1 = x[:m],x[m:]
y0,y1 = y[:m],y[m:]
z0 = mult_lists(x0,y0,b)
z1 = mult_lists(x1,y1,b)
z2 = mult_lists(sum_lists(x0,x1,b),sum_lists(y0,y1,b),b)
t1 = lshift(sub_lists(z2,sum_lists(z1,z0,b),b),m)
t2 = lshift(z1,m*2)
return sum_lists(sum_lists(z0,t1,b),t2,b)
sum_lists and sub_lists returns unnormalized result - single digit can be greater than the base value. normalize function solved this problem.
All functions expect to get list of digits in the reverse order. For example 12 in base 10 should be written as [2,1]. Lets take a square of 9987654321.
» a = [1,2,3,4,5,6,7,8,9]
» res = mult_lists(a,a,10)
» res.reverse()
» res
[9, 7, 5, 4, 6, 1, 0, 5, 7, 7, 8, 9, 9, 7, 1, 0, 4, 1]
The goal of the Karatsuba multiplication is to improve on the divide-and conquer multiplication algorithm by making 3 recursive calls instead of four. Therefore, the only lines in your script that should contain a recursive call to the multiplication are those assigning z0,z1 and z2. Anything else will give you a worse complexity. You can't use pow to compute bm when you haven't defined multiplication yet (and a fortiori exponentiation), either.
For that, the algorithm crucially uses the fact that it is using a positional notation system. If you have a representation x of a number in base b, then x*bm is simply obtained by shifting the digits of that representation m times to the left. That shifting operation is essentially "free" with any positional notation system. That also means that if you want to implement that, you have to reproduce this positional notation, and the "free" shift. Either you chose to compute in base b=2 and use python's bit operators (or the bit operators of a given decimal, hex, ... base if your test platform has them), or you decide to implement for educational purposes something that works for an arbitrary b, and you reproduce this positional arithmetic with something like strings, arrays, or lists.
You have a solution with lists already. I like to work with strings in python, since int(s, base) will give you the integer corresponding to the string s seen as a number representation in base base: it makes tests easy. I have posted an heavily commented string-based implementation as a gist here, including string-to-number and number-to-string primitives for good measure.
You can test it by providing padded strings with the base and their (equal) length as arguments to mult:
In [169]: mult("987654321","987654321",10,9)
Out[169]: '966551847789971041'
If you don't want to figure out the padding or count string lengths, a padding function can do it for you:
In [170]: padding("987654321","2")
Out[170]: ('987654321', '000000002', 9)
And of course it works with b>10:
In [171]: mult('987654321', '000000002', 16, 9)
Out[171]: '130eca8642'
(Check with wolfram alpha)
I believe that the idea behind the technique is that the zi terms are computed using the recursive algorithm, but the results are not unified together that way. Since the net result that you want is
z0 B^2m + z1 B^m + z2
Assuming that you choose a suitable value of B (say, 2) you can compute B^m without doing any multiplications. For example, when using B = 2, you can compute B^m using bit shifts rather than multiplications. This means that the last step can be done without doing any multiplications at all.
One more thing - I noticed that you've picked a fixed value of m for the whole algorithm. Typically, you would implement this algorithm by having m always be a value such that B^m is half the number of digits in x and y when they are written in base B. If you're using powers of two, this would be done by picking m = ceil((log x) / 2).
Hope this helps!
In Python 2.7: Save this file as Karatsuba.py
def karatsuba(x,y):
"""Karatsuba multiplication algorithm.
Return the product of two numbers in an efficient manner
#author Shashank
date: 23-09-2018
Parameters
----------
x : int
First Number
y : int
Second Number
Returns
-------
prod : int
The product of two numbers
Examples
--------
>>> import Karatsuba.karatsuba
>>> a = 1234567899876543211234567899876543211234567899876543211234567890
>>> b = 9876543211234567899876543211234567899876543211234567899876543210
>>> Karatsuba.karatsuba(a,b)
12193263210333790590595945731931108068998628253528425547401310676055479323014784354458161844612101832860844366209419311263526900
"""
if len(str(x)) == 1 or len(str(y)) == 1:
return x*y
else:
n = max(len(str(x)), len(str(y)))
m = n/2
a = x/10**m
b = x%10**m
c = y/10**m
d = y%10**m
ac = karatsuba(a,c) #step 1
bd = karatsuba(b,d) #step 2
ad_plus_bc = karatsuba(a+b, c+d) - ac - bd #step 3
prod = ac*10**(2*m) + bd + ad_plus_bc*10**m #step 4
return prod
Related
To do some simulations in Python, I'm trying to generate numbers a,b,c such that a^2 + b^2 + c^2 = 1. I think generating some a between 0 and 1, then some b between 0 and sqrt(1 - a^2), and then c = sqrt(1 - a^2 - b^2) would work.
Floating point values are fine, the sum of squares should be close to 1. I want to keep generating them for some iterations.
Being new to Python, I'm not really sure how to do this. Negatives are allowed.
Edit: Thanks a lot for the answers!
According to this answer at stats.stackexchange.com, you should use normally distributed values to get uniformly distributed values on a sphere. That would mean, you could do:
import numpy as np
abc = np.random.normal(size=3)
a,b,c = abc/np.sqrt(sum(abc**2))
Just in case your interested in the probability densities I decided to do a comparison between the different approaches:
import numpy as np
import random
import math
def MSeifert():
a = 1
b = 1
while a**2 + b**2 > 1: # discard any a and b whose sum of squares already exceeds 1
a = random.random()
b = random.random()
c = math.sqrt(1 - a**2 - b**2) # fixed c
return a, b, c
def VBB():
x = np.random.uniform(0,1,3) # random numbers in [0, 1)
x /= np.sqrt(x[0] ** 2 + x[1] ** 2 + x[2] ** 2)
return x[0], x[1], x[2]
def user3684792():
theta = random.uniform(0, 0.5*np.pi)
phi = random.uniform(0, 0.5*np.pi)
return np.sin(theta)* np.cos(phi), np.sin(theta)*np.sin(phi), np.cos(theta)
def JohanL():
abc = np.random.normal(size=3)
a,b,c = abc/np.sqrt(sum(abc**2))
return a, b, c
def SeverinPappadeux():
cos_th = 2.0*random.uniform(0, 1.0) - 1.0
sin_th = math.sqrt(1.0 - cos_th*cos_th)
phi = random.uniform(0, 2.0*math.pi)
return sin_th * math.cos(phi), sin_th * math.sin(phi), cos_th
And plotting the distributions:
%matplotlib notebook
import matplotlib.pyplot as plt
f, axes = plt.subplots(3, 4)
for func_idx, func in enumerate([MSeifert, JohanL, user3684792, VBB]):
axes[0, func_idx].set_title(str(func.__name__))
res = [func() for _ in range(50000)]
for idx in range(3):
axes[idx, func_idx].hist([i[idx] for i in res], bins='auto')
axes[0, 0].set_ylabel('a')
axes[1, 0].set_ylabel('b')
axes[2, 0].set_ylabel('c')
plt.tight_layout()
With the result:
Explanation: The rows show the distributions for a, b and c respectively while the columns show the histograms (distributions) of the different approaches.
The only approaches that give a uniformly random distribution in the range (-1, 1) are JohanLs and Severin Pappadeux's approach. All other approaches have some features like spikes or a functional behavior in the range [0, 1). Note that these two solution currently gives values between -1 and 1 while all other approaches give values between 0 and 1.
I think it is actually a cool problem, and a nice way to do this is to just use spherical polar coordinates and generate the angles at random.
import random
import numpy as np
def random_pt():
theta = random.uniform(0, 0.5*np.pi)
phi = random.uniform(0, 0.5*np.pi)
return np.sin(theta)* np.cos(phi), np.sin(theta)*np.sin(phi), np.cos(theta)
You could do it like this:
import random
import math
def three_random_numbers_adding_to_one():
a = 1
b = 1
while a**2 + b**2 > 1: # discard any a and b whose sum of squares already exceeds 1
a = random.random()
b = random.random()
c = math.sqrt(1 - a**2 - b**2) # fixed c
return a, b, c
a, b, c = three_random_numbers_adding_to_one()
print(a**2 + b**2 + c**2)
However floats have only limited precision so these won't add to exactly 1, just approximately.
You may need to check if the numbers generated with this function are "random enough". It could be that this setup biases the "randomness".
The "right" answer depends on whether you are looking for a uniform random distribution in space, or on the surface of a sphere, or something else. If you are looking for points on the surface of a sphere, you still have to worry about the cos(theta) factor which will cause points to appear "bunched up" near the poles of the sphere. Since exact nature is not clear from your question, here is a "totally random" distribution that should work:
x = np.random.uniform(0,1,3) # random numbers in [0, 1)
x /= np.sqrt(x[0] ** 2 + x[1] ** 2 + x[2] ** 2)
Another advantage here is that since we are using numpy arrays, you can quickly scale to large sets of points too, by using x = np.random.uniform(0, 1, (3, n)) for any n.
Time to add another solution, heh...
This time it is truly uniform on the unit sphere point picking - check http://mathworld.wolfram.com/SpherePointPicking.html for details
import math
import random
def random_pt():
cos_th = 2.0*random.uniform(0, 1.0) - 1.0
sin_th = math.sqrt(1.0 - cos_th*cos_th)
phi = random.uniform(0, 2.0*math.pi)
return sin_th * math.cos(phi), sin_th * math.sin(phi), cos_th
for k in range(0, 100):
a, b, c = random_pt()
print("{0} {1} {2} {3}".format(a, b, c, a*a + b*b + c*c))
The program needs to compute define integral with a predetermined
accuracy (eps) with the Trapezoidal Rule and my function needs to return:
1.the approximate value of the integral.
2.the number of iterations.
My code:
from math import *
def f1(x):
return (x ** 2 - 1)**(-0.5)
def f2(x):
return (cos(x)/(x + 1))
def integral(f,a,b,eps):
n = 2
x = a
h = (b - a) / n
sum = 0.5 * (f(a) + f(b))
for i in range(n):
sum = sum + f(a + i * h)
sum_2 = h * sum
k = 0
flag = 1
while flag == 1:
n = n * 2
sum = 0
k = k + 1
x = a
h = (b - a) / n
sum = 0.5 * (f(a) + f(b))
for i in range(n):
sum = sum + f(a + i * h)
sum_new = h * sum
if eps > abs(sum_new - sum_2):
t1 = sum_new
t2 = k
return t1, t2
else:
sum_2 = sum_new
x1 = float(input("First-begin: "))
x2 = float(input("First-end: "))
y1 = float(input("Second-begin: "))
y2 = float(input("Second-end: "))
int_1 = integral(f1,x1,y1,1e-6)
int_2 = integral(f2,x2,y2,1e-6)
print(int_1)
print(int_2)
It doesn't work correct. Help, please!
You implemented the math wrong. The error is in the lines
for i in range(n):
sum = sum + f(a + i * h)
range(n) always starts at 0, so in your first iteration you just add the f(a) term again.
If you replace it with
for i in range(1, n):
sum = sum + f(a + i * h)
it works.
Also, you have a ton of redundant code; you basically coded the core of the integration algorithm twice. Try to follow the DRY-principle.
The trapezoidal rule of integration simply says that an approximation to the integral $\int_a^b f(x) dx$ is (b-a) (f(a)+f(b))/2. The error is proportional to (b-a)^2, so that it is possible to have a better estimate using the composite rule, i.e., subdividing the initial interval in a number of shorter intervals.
Is it possible to use shorter intervals and still reuse the function values previously computed, so minimizing the total number of function evaluation?
Yes, it is possible if we divide each interval in two equal parts, so that at stage 0 we use 1 intervals, at stage 1 2 equal intervals and in general, at stage n, we use 2n equal intervals.
Let's start with a simple problem and see if it possible to generalize the procedure…
a, b = 0, 32
L = b-a = 32
by the trapezoidal rule the initial approximation say I0, is given by
I0 = L * (f0+f1)/2
= L * S0
with S0 = (f0+f1)/2; a pictorial representation of the real axis, the coordinates of the interval extremes and the evaluated functions follows
x0 x1
01234567890123456789012345679012
f0 f1
Next, we divide the original interval in two,
L = L/2
x0 x2 x1
01234567890123456789012345679012
f0 f2 f1
and the new approximation, stage n=1, is obtained using two times the trapezoidal rule and applying a bit of algebra
I1 = L * (f0+f2)/2 + L * (f2+f1)/2
= L * [(f0+f1)/2 + f2]
= L * [S0 + S1]
with S1 = f2
Another subdivision, stage n=2, L = L/2 and
x0 x3 x2 x4 x1
012345678901234567890123456789012
f0 f3 f2 f4 f1
I2 = L * [(f0+f3) + (f3+f2) + (f2+f4) + (f4+f1)] / 2
= L * [(f0+f1)/2 + f2 + (f3+f4)]
= L * [S0+S1+S2]
with S2 = f3 + f4.
It is not difficult, given this picture,
x0 x5 x3 x6 x2 x7 x4 x8 x1
012345678901234567890123456789012
f0 f5 f3 f6 f2 f7 f4 f8 f1
to understand that our next approximation can be computed as follows
L = L/2
S3 = f5+f6+f7+f8
I3 = L*[S0+S1+S2+S3]
Now, we have to understand how to compute a generalization of Sn,
n = 1, … — for us, the pseudocode is
L_n = (b-a)/2**n
list_x_n = list(a + L_n + 2*Ln*j for j=0, …, 2**n-1)
Sn = Sum(f(xj) for each xj in list_x_n)
For n = 3, L = (b-a)/8 = 4, we have from the formula above list_x_n = [4, 12, 20, 28], please check with the picture...
Now we are ready to code our algorithm in Python
def trapaezia(f, a, b, tol):
"returns integ(f, (a,b)), estimated error and number of evaluations"
from math import fsum # controls accumulation of rounding errors in sums
L = b - a
S = (f(a)+f(b))/2
I = L*S
n = 1
while True:
L = L/2
new_points = (a+L+j*L for j in range(0, n+n, 2))
delta_S = fsum(f(x) for x in new_points)
new_S = S + delta_S
new_I = L*new_S
# error is estimated using Richardson extrapolation (REP)
err = (new_I - I) * 4/3
if abs(err) > tol:
n = n+n
S, I = new_S, new_I
else:
# we return a better estimate using again REP
return (4*new_I-I)/3, err, n+n+1
If you are curious about Richardson extrapolation, I recommend this document that deals exactly with the application of REP to the trapezoidal rule quadrature algorithm.
If you are curious about math.fsum, the docs don't say too much but the link to the original implementation that also includes an extended explanation of all the issues involved.
My textbook gives the pseudo code for Horner's method as follows:
P(x) = a_n x n + a_n−1 x n−1 + ··· + a_1 x + a_0 = (x−x0)Q(x) + b0
INPUT degree n; coefficients a_0 , a_1 , . . . , a_n ; x0 .
OUTPUT y = P(x 0 ); z = P (x 0 ).
Step 1 Set y = a_n ; (Compute b n for P.)
z = a_n . (Compute b n−1 for Q.)
Step 2 For j = n − 1, n − 2, . . . , 1
set y = x0 * y + a_j ; (Compute b_j for P.)
z = x0 * z + y. (Compute b_j−1 for Q.)
Step 3 Set y = x0 + y + a_0 .
Step 4 OUTPUT (y, z);
STOP.
Now the issue I have here is that the subscript in the pseudo code does not seem to match the subscripts in the formula
I did a python implementation, but the the answers I got seemed wrong, so I changed it slightly
def horner(x0, *a):
'''
Horner's method is an algorithm to calculate a polynomial at
f(x0) and f'(x0)
x0 - The value to avaluate
a - An array of the coefficients
The degree is the polynomial is set equal to the number of coefficients
'''
n = len(a)
y = a[0]
z = a[0]
for j in range(1, n):
y = x0 * y + a[j]
z = x0 * z + y
y = x0 * y + a[-1]
print('P(x0) =', y)
print('P\'(x0) =', z)
It works pretty well, but I need someone with more experience in this regard to give it a once over.
As a very basic test I took the polynomial 2x^4 with a value of -2 for x.
To use it in the method I call it as such
horner(-2, 2, 0, 0 ,0)
The output does indeed look correct for this very simple problem. Since f(x) = 2x^4 then f(-2) = -32, but this is where my implementation gives a different result my answer is positive
I found the problem and I am adding the answer here since it may prove useful to someone in the future
Firstly there is a but in the algorithm, the loop must go up to n-1
def horner(x0, *a):
'''
Horner's method is an algorithm to calculate a polynomial at
f(x0) and f'(x0)
x0 - The value to avaluate
a - An array of the coefficients
The degree is the polynomial is set equal to the number of coefficients
'''
n = len(a)
y = a[0]
z = a[0]
for j in range(1, n - 1):
y = x0 * y + a[j]
z = x0 * z + y
y = x0 * y + a[-1]
print('P(x0) =', y)
print('P\'(x0) =', z)
Second, and this was my biggest mistake, I am not passing enough coefficients to the method
This will calculate 2x^3
horner(-2, 2, 0, 0, 0)
I actually had to call
horner(-2, 2, 0, 0, 0, 0)
I'm trying to plot a piecewise fit to my data, but I need to do it with an arbitrary number of line segments. Sometimes there are three segments; sometimes there are two. I'm storing the coefficients of the fit in actable and the bounds on the segments in btable.
Here are example values of my bounds:
btable = [[0.00499999989, 0.0244274978], [0.0244275965, 0.0599999987]]
Here are example values of my coefficients:
actable = [[0.0108687987, -0.673182865, 14.6420775], [0.00410866373, -0.0588355861, 1.07750032]]
Here's what my code looks like:
rfig = plt.figure()
<>various other plot specifications<>
x = np.arange(0.005, 0.06, 0.0001)
y = np.piecewise(x, [(x >= btable[i][0]) & (x <= btable[i][1]) for i in range(len(btable))], [lambda x=x: np.log10(actable[j][0] + actable[j][2] * x + actable[j][2] * x**2) for j in list(range(len(actable)))])
plt.plot(x, y)
The problem is that lambda sets itself to the last instance of the list, so it uses the coefficients for the last segment for all the segments. I don't know how to do a piecewise function without using lambda.
Currently, I'm cheating by doing this:
if len(btable) == 2:
y = np.piecewise(x, [(x >= btable[i][0]) & (x <= btable[i][1]) for i in range(len(btable))], [lambda x: np.log10(actable[0][0] + actable[0][1] * x + actable[0][2] * x**2), lambda x: np.log10(actable[1][0] + actable[1][1] * x + actable[1][2] * x**2)])
else if len(btable) == 3:
y = np.piecewise(x, [(x >= btable[i][0]) & (x <= btable[i][1]) for i in range(len(btable))], [lambda x: np.log10(actable[0][0] + actable[0][1] * x + actable[0][2] * x**2), lambda x: np.log10(actable[1][0] + actable[1][1] * x + actable[1][2] * x**2), lambda x: np.log10(actable[2][0] + actable[2][1] * x + actable[2][2] * x**2)])
else
print('Oh no! You have fewer than 2 or more than 3 segments!')
But this makes me feel icky on the inside. I know there must be a better solution. Can someone help?
This issue is common enough that Python's official documentation has an article Why do lambdas defined in a loop with different values all return the same result? with a suggested solution: create a local variable to be initialized by the loop variable, to capture the changing values of the latter within the function.
That is, in the definition of y it suffices to replace
[lambda x=x: np.log10(actable[j][0] + actable[j][1] * x + actable[j][2] * x**2) for j in range(len(actable))]
by
[lambda x=x, k=j: np.log10(actable[k][0] + actable[k][1] * x + actable[k][2] * x**2) for j in range(len(actable))]
By the way, one can use one-sided inequalities to specify ranges for numpy.piecewise: the last of the conditions that evaluate to True will trigger the corresponding function. (This is a somewhat counterintuitive priority; using the first true condition would be more natural, like SymPy does). If the breakpoints are arranged in increasing order, then one should use "x>=" inequalities:
breaks = np.arange(0, 10) # breakpoints
coeff = np.arange(0, 20, 2) # coefficients to use
x = np.arange(0, 10, 0.1)
y = np.piecewise(x, [x >= b for b in breaks], [lambda x=x, a=c: a*x for c in coeff])
Here each coefficient will be used for the interval that begins with the corresponding breakpoint; e.g., coefficient c=0 is used in the range 0<=x<1, coefficient c=2 in the range 1<=x<2, and so on.
I've written some Python code to do some image processing work, but it takes a huge amount of time to run. I've spent the last few hours trying to optimize it, but I think I've reached the end of my abilities.
Looking at the outputs from the profiler, the function below is taking a large proportion of the overall time of my code. Is there any way that it can be speeded up?
def make_ellipse(x, x0, y, y0, theta, a, b):
c = np.cos(theta)
s = np.sin(theta)
a2 = a**2
b2 = b**2
xnew = x - x0
ynew = y - y0
ellipse = (xnew * c + ynew * s)**2/a2 + (xnew * s - ynew * c)**2/b2 <= 1
return ellipse
To give the context, it is called with x and y as the output from np.meshgrid with a fairly large grid size, and all of the other parameters as simple integer values.
Although that function seems to be taking a lot of the time, there are probably ways that the rest of the code can be speeded up too. I've put the rest of the code at this gist.
Any ideas would be gratefully received. I've tried using numba and autojiting the main functions, but that doesn't help much.
Let's try to optimize make_ellipse in conjunction with its caller.
First, notice that a and b are the same over many calls. Since make_ellipse squares them each time, just have the caller do that instead.
Second, notice that np.cos(np.arctan(theta)) is 1 / np.sqrt(1 + theta**2) which seems slightly faster on my system. A similar trick can be used to compute the sine, either from theta or from cos(theta) (or vice versa).
Third, and less concretely, think about short-circuiting some of the final ellipse formula evaluations. For example, wherever (xnew * c + ynew * s)**2/a2 is greater than 1, the ellipse value must be False. If this happens often, you can "mask" out the second half of the (expensive) calculation of the ellipse at those locations. I haven't planned this thoroughly, but see numpy.ma for some possible leads.
It won't speed up things for all cases, but if your ellipses don't take up the whole image, you should limit your search for points inside the ellipse to its bounding rectangle. I am lazy with the math, so I googled it and reused #JohnZwinck neat cosine of an arctangent trick to come up with this function:
def ellipse_bounding_box(x0, y0, theta, a, b):
x_tan_t = -b * np.tan(theta) / a
if np.isinf(x_tan_t) :
x_cos_t = 0
x_sin_t = np.sign(x_tan_t)
else :
x_cos_t = 1 / np.sqrt(1 + x_tan_t*x_tan_t)
x_sin_t = x_tan_t * x_cos_t
x = x0 + a*x_cos_t*np.cos(theta) - b*x_sin_t*np.sin(theta)
y_tan_t = b / np.tan(theta) / a
if np.isinf(y_tan_t):
y_cos_t = 0
y_sin_t = np.sign(y_tan_t)
else:
y_cos_t = 1 / np.sqrt(1 + y_tan_t*y_tan_t)
y_sin_t = y_tan_t * y_cos_t
y = y0 + b*y_sin_t*np.cos(theta) + a*y_cos_t*np.sin(theta)
return np.sort([-x, x]), np.sort([-y, y])
You can now modify your original function to something like this:
def make_ellipse(x, x0, y, y0, theta, a, b):
c = np.cos(theta)
s = np.sin(theta)
a2 = a**2
b2 = b**2
x_box, y_box = ellipse_bounding_box(x0, y0, theta, a, b)
indices = ((x >= x_box[0]) & (x <= x_box[1]) &
(y >= y_box[0]) & (y <= y_box[1]))
xnew = x[indices] - x0
ynew = y[indices] - y0
ellipse = np.zeros_like(x, dtype=np.bool)
ellipse[indices] = ((xnew * c + ynew * s)**2/a2 +
(xnew * s - ynew * c)**2/b2 <= 1)
return ellipse
Since everything but x and y are integers, you can try to minimize the number of array computations. I imagine most of the time is spent in this statement:
ellipse = (xnew * c + ynew * s)**2/a2 + (xnew * s - ynew * c)**2/b2 <= 1
A simple rewriting like so should reduce the number of array operations:
a = float(a)
b = float(b)
ellipse = (xnew * (c/a) + ynew * (s/a))**2 + (xnew * (s/b) - ynew * (c/b))**2 <= 1
What was 12 array operations is now 10 (plus 4 scalar ops). I'm not sure if numba's jit would have tried this. It might just do all the broadcasting first, then jit the resulting operations. In this case, reordering so common operations are done at once should help.
Furthering along, you can rewrite this again as
ellipse = ((xnew + ynew * (s/c)) * (c/a))**2 + ((xnew * (s/c) - ynew) * (c/b))**2 <= 1
Or
t = numpy.tan(theta)
ellipse = ((xnew + ynew * t) * (b/a))**2 + (xnew * t - ynew)**2 <= (b/c)**2
Replacing one more array operation with a scalar, and eliminating other scalar ops to get 9 array operations and 2 scalar ops.
As always, be aware of what the range of inputs are to avoid rounding errors.
Unfortunately there's no way good way to do a running sum and bail early if either of the two addends is greater than the right hand side of the comparison. That would be an obvious speed-up, but one you'd need cython (or c/c++) to code.
You can speed it up considerably by using Cython. There is a very good documentation on how to do this.