So I was bored, and I decided to come up with a method to calculate pi. I implemented it, and it ran well. I wanted to optimize it, so I ran the profiler. It took about 26 seconds. I discovered that the abs() function took up a lot of lag, so I came up with a way to avoid the abs() function. After that, I could run it in 8 seconds! Can someone explain to me why the abs() function was taking so long?
Here is the code without abs():
def picalc(radius = 10000000):
total = 0
x = 0
y = radius
for i in range(radius + 1):
x1 = i
y1 = (radius ** 2 - x1 ** 2) ** 0.5
total += ((x1 - x) ** 2 + (y1 - y) ** 2) ** 0.5
x = x1
y = y1
print(total / (radius / 2))
import profile
profile.run('picalc()')
If I change the line total += ((x1 - x) ** 2 + (y1 - y) ** 2) ** 0.5 to total += (abs(x1 - x) ** 2 + abs(y1 - y) ** 2) ** 0.5, the operation runs MUCH slower.
EDIT: I know that the negatives cancel when squaring. That was a mistake I made.
EDIT x2: I tried substituting total += ((x1 - x) ** 2 + (y1 - y) ** 2) ** 0.5 with total += math.hypot(x1 - x, y1 - y), but the profiler tells me it took 10 seconds longer! I read the docs and they said that the math library contains thin wrappers to the C math library (at least in IDLE). How can C be slower than Python in this case?
First of all: the abs() calls are entirely redundant if you are squaring the result anyway.
Next, you may be reading the profile output wrong; don't mistake the cumulative times with the time spent only on the function call itself; you are calling abs() many many times so the accumulated time will raise rapidly.
Moreover, profiling adds a lot of overhead to the interpreter. Use the timeit module to compare the performance between approaches, it gives you overall performance metrics so you can compare apples with apples.
It is not that the abs() function is slow; it is calling any function that is 'slow'. Looking up the global name is slower than looking up locals, and then you need to push the current frame on the stack, execute the function, then pop the frame from the stack again.
You can alleviate one of those pain points by making abs() a local name outside the loop:
_abs = abs
for i in range(radius + 1):
# ...
total += (_abs(x1 - x) ** 2 + _abs(y1 - y) ** 2) ** 0.5
Not that abs() really is taking such a huge toll on your performance, really, not when you time your functions properly. Using a radius of 1000 to make 100 repeats practical, timeit comparisons give me:
>>> from timeit import timeit
>>> def picalc(radius = 10000000):
... total = 0
... x = 0
... y = radius
... for i in range(radius + 1):
... x1 = i
... y1 = (radius ** 2 - x1 ** 2) ** 0.5
... total += ((x1 - x) ** 2 + (y1 - y) ** 2) ** 0.5
... x = x1
... y = y1
...
>>> def picalc_abs(radius = 10000000):
... total = 0
... x = 0
... y = radius
... for i in range(radius + 1):
... x1 = i
... y1 = (radius ** 2 - x1 ** 2) ** 0.5
... total += (abs(x1 - x) ** 2 + abs(y1 - y) ** 2) ** 0.5
... x = x1
... y = y1
...
>>> def picalc_abs_local(radius = 10000000):
... total = 0
... x = 0
... y = radius
... _abs = abs
... for i in range(radius + 1):
... x1 = i
... y1 = (radius ** 2 - x1 ** 2) ** 0.5
... total += (_abs(x1 - x) ** 2 + _abs(y1 - y) ** 2) ** 0.5
... x = x1
... y = y1
...
>>> timeit('picalc(1000)', 'from __main__ import picalc', number=100)
0.13862298399908468
>>> timeit('picalc(1000)', 'from __main__ import picalc_abs as picalc', number=100)
0.14540845900774002
>>> timeit('picalc(1000)', 'from __main__ import picalc_abs_local as picalc', number=100)
0.13702849800756667
Notice how there is very little difference between the three approaches now.
Related
I'm writing a simple integration approximation function. Thing is that no matter what I do in my code, it seems that my while implementation always beats my for implementation which is very weird, since for should be faster as it does not check a boolean expression and increase a variable on each iteration.
The code:
import math
import time
import numpy as np
def function(x):
return math.cos(x)
def integrate_while(func, x_start, x_end, n_steps=10_000):
step_length = (x_end - x_start) / n_steps
area = 0
x = x_start
x_end -= step_length
while x < x_end:
area += func(x) + func(x + step_length)
x += step_length
area *= step_length / 2
return area
def integrate_for(func, x_start, x_end, n_steps=10_000):
step_length = (x_end - x_start) / n_steps
area = 0
ls = np.linspace(x_start, x_end, n_steps)
for x in ls:
area += func(x) + func(x + step_length)
area *= step_length / 2
return area
def integrate_for_range(func, x_start, x_end, n_steps=10_000):
step_length = (x_end - x_start) / n_steps
area = 0
for i in range(n_steps):
x = x_start + i * step_length
area += func(x) + func(x + step_length)
area *= step_length / 2
return area
def test():
integrate_funcs = [integrate_while, integrate_for, integrate_for_range]
for integrate_func in integrate_funcs:
t1 = time.time_ns()
result = integrate_func(function, 0, math.pi / 2, n_steps=1_000_000)
t2 = time.time_ns()
print(f'Function {integrate_func.__name__}. Result: {round(result, 4)}, Elapsed ns: {t2-t1:,}.')
test()
Results:
Function integrate_while. Result: 1.0, Elapsed ns: 569,587,400.
Function integrate_for. Result: 1.0, Elapsed ns: 638,829,800.
Function integrate_for_range. Result: 1.0, Elapsed ns: 596,499,300.
Edit: Already checked the np.linspace object creation impact on total excecution time of the for loop and it is less than 1% of total time.
Original np.linspace was the problem since numpy arrays are not thought to be iterated as there is important overhead when you do so.
After changing np.linspace with native range type, I could overperform the simple while loop.
Also some extra changes improved performance too:
Setting variable types before entering into for loop (10% less time).
Change loop logic to avoid extra function calls (50% less time).
Final function code:
def integrate_for_range_opt(func, x_start, x_end, n_steps=10_000):
step_length = (x_end - x_start) / n_steps
area = 0.0
x = float(x_start) + step_length
for i in range(n_steps - 1):
area = area + func(x)
x = x + step_length
area = area * 2.0 + func(x_start) + func(x_end)
return area * step_length / 2.0
Your question doesn't actually ask a question and to me it seemed to ask for an explanation, but your comments and your own answer give me the impression that what you really want is a fast solution. So here's a faster one, making Python functions do more work for you:
from itertools import accumulate, repeat
def integrate_map_accumulate(func, x_start, x_end, n_steps=10_000):
step_length = (x_end - x_start) / n_steps
xs = accumulate(repeat(step_length, n_steps - 2),
initial = x_start + step_length)
area = sum(map(func, xs))
area = area * 2.0 + func(x_start) + func(x_end)
return area * step_length / 2.0
In my testing with your benchmark code, that reduced the time from ~0.21 seconds (for your answer's solution) to ~0.14 seconds.
I will try and explain exactly what's going on and my issue.
This is a bit mathy and SO doesn't support latex, so sadly I had to resort to images. I hope that's okay.
I don't know why it's inverted, sorry about that.
At any rate, this is a linear system Ax = b where we know A and b, so we can find x, which is our approximation at the next time step. We continue doing this until time t_final.
This is the code
import numpy as np
tau = 2 * np.pi
tau2 = tau * tau
i = complex(0,1)
def solution_f(t, x):
return 0.5 * (np.exp(-tau * i * x) * np.exp((2 - tau2) * i * t) + np.exp(tau * i * x) * np.exp((tau2 + 4) * i * t))
def solution_g(t, x):
return 0.5 * (np.exp(-tau * i * x) * np.exp((2 - tau2) * i * t) - np.exp(tau * i * x) * np.exp((tau2 + 4) * i * t))
for l in range(2, 12):
N = 2 ** l #number of grid points
dx = 1.0 / N #space between grid points
dx2 = dx * dx
dt = dx #time step
t_final = 1
approximate_f = np.zeros((N, 1), dtype = np.complex)
approximate_g = np.zeros((N, 1), dtype = np.complex)
#Insert initial conditions
for k in range(N):
approximate_f[k, 0] = np.cos(tau * k * dx)
approximate_g[k, 0] = -i * np.sin(tau * k * dx)
#Create coefficient matrix
A = np.zeros((2 * N, 2 * N), dtype = np.complex)
#First row is special
A[0, 0] = 1 -3*i*dt
A[0, N] = ((2 * dt / dx2) + dt) * i
A[0, N + 1] = (-dt / dx2) * i
A[0, -1] = (-dt / dx2) * i
#Last row is special
A[N - 1, N - 1] = 1 - (3 * dt) * i
A[N - 1, N] = (-dt / dx2) * i
A[N - 1, -2] = (-dt / dx2) * i
A[N - 1, -1] = ((2 * dt / dx2) + dt) * i
#middle
for k in range(1, N - 1):
A[k, k] = 1 - (3 * dt) * i
A[k, k + N - 1] = (-dt / dx2) * i
A[k, k + N] = ((2 * dt / dx2) + dt) * i
A[k, k + N + 1] = (-dt / dx2) * i
#Bottom half
A[N :, :N] = A[:N, N:]
A[N:, N:] = A[:N, :N]
Ainv = np.linalg.inv(A)
#Advance through time
time = 0
while time < t_final:
b = np.concatenate((approximate_f, approximate_g), axis = 0)
x = np.dot(Ainv, b) #Solve Ax = b
approximate_f = x[:N]
approximate_g = x[N:]
time += dt
approximate_solution = np.concatenate((approximate_f, approximate_g), axis=0)
#Calculate the actual solution
actual_f = np.zeros((N, 1), dtype = np.complex)
actual_g = np.zeros((N, 1), dtype = np.complex)
for k in range(N):
actual_f[k, 0] = solution_f(t_final, k * dx)
actual_g[k, 0] = solution_g(t_final, k * dx)
actual_solution = np.concatenate((actual_f, actual_g), axis = 0)
print(np.sqrt(dx) * np.linalg.norm(actual_solution - approximate_solution))
It doesn't work. At least not in the beginning, it shouldn't start this slow. I should be unconditionally stable and converge to the right answer.
What's going wrong here?
The L2-norm can be a useful metric to test convergence, but isn't ideal when debugging as it doesn't explain what the problem is. Although your solution should be unconditionally stable, backward Euler won't necessarily converge to the right answer. Just like forward Euler is notoriously unstable (anti-dissipative), backward Euler is notoriously dissipative. Plotting your solutions confirms this. The numerical solutions converge to zero. For a next-order approximation, Crank-Nicolson is a reasonable candidate. The code below contains the more general theta-method so that you can tune the implicit-ness of the solution. theta=0.5 gives CN, theta=1 gives BE, and theta=0 gives FE.
A couple other things that I tweaked:
I selected a more appropriate time step of dt = (dx**2)/2 instead of dt = dx. That latter doesn't converge to the right solution using CN.
It's a minor note, but since t_final isn't guaranteed to be a multiple of dt, you weren't comparing solutions at the same time step.
With regards to your comment about it being slow: As you increase the spatial resolution, your time resolution needs to increase too. Even in your case with dt=dx, you have to perform a (1024 x 1024)*1024 matrix multiplication 1024 times. I didn't find this to take particularly long on my machine. I removed some unneeded concatenation to speed it up a bit, but changing the time step to dt = (dx**2)/2 will really bog things down, unfortunately. You could trying compiling with Numba if you are concerned with speed.
All that said, I didn't find tremendous success with the consistency of CN. I had to set N=2^6 to get anything at t_final=1. Increasing t_final makes this worse, decreasing t_final makes it better. Depending on your needs, you could looking into implementing TR-BDF2 or other linear multistep methods to improve this.
The code with a plot is below:
import numpy as np
import matplotlib.pyplot as plt
tau = 2 * np.pi
tau2 = tau * tau
i = complex(0,1)
def solution_f(t, x):
return 0.5 * (np.exp(-tau * i * x) * np.exp((2 - tau2) * i * t) + np.exp(tau * i * x) * np.exp((tau2 + 4) * i * t))
def solution_g(t, x):
return 0.5 * (np.exp(-tau * i * x) * np.exp((2 - tau2) * i * t) - np.exp(tau * i * x) *
np.exp((tau2 + 4) * i * t))
l=6
N = 2 ** l
dx = 1.0 / N
dx2 = dx * dx
dt = dx2/2
t_final = 1.
x_arr = np.arange(0,1,dx)
approximate_f = np.cos(tau*x_arr)
approximate_g = -i*np.sin(tau*x_arr)
H = np.zeros([2*N,2*N], dtype=np.complex)
for k in range(N):
H[k,k] = -3*i*dt
H[k,k+N] = (2/dx2+1)*i*dt
if k==0:
H[k,N+1] = -i/dx2*dt
H[k,-1] = -i/dx2*dt
elif k==N-1:
H[N-1,N] = -i/dx2*dt
H[N-1,-2] = -i/dx2*dt
else:
H[k,k+N-1] = -i/dx2*dt
H[k,k+N+1] = -i/dx2*dt
### Bottom half
H[N :, :N] = H[:N, N:]
H[N:, N:] = H[:N, :N]
### Theta method. 0.5 -> Crank Nicolson
theta=0.5
A = np.eye(2*N)+H*theta
B = np.eye(2*N)-H*(1-theta)
### Precompute for faster computations
mat = np.linalg.inv(A)#B
t = 0
b = np.concatenate((approximate_f, approximate_g))
while t < t_final:
t += dt
b = mat#b
approximate_f = b[:N]
approximate_g = b[N:]
approximate_solution = np.concatenate((approximate_f, approximate_g))
#Calculate the actual solution
actual_f = solution_f(t,np.arange(0,1,dx))
actual_g = solution_g(t,np.arange(0,1,dx))
actual_solution = np.concatenate((actual_f, actual_g))
plt.figure(figsize=(7,5))
plt.plot(x_arr,actual_f.real,c="C0",label=r"$Re(f_\mathrm{true})$")
plt.plot(x_arr,actual_f.imag,c="C1",label=r"$Im(f_\mathrm{true})$")
plt.plot(x_arr,approximate_f.real,c="C0",ls="--",label=r"$Re(f_\mathrm{num})$")
plt.plot(x_arr,approximate_f.imag,c="C1",ls="--",label=r"$Im(f_\mathrm{num})$")
plt.legend(loc=3,fontsize=12)
plt.xlabel("x")
plt.savefig("num_approx.png",dpi=150)
I am not going to go through all of your math, but I'm going to offer a suggestion.
The use of a direct calculation for fxx and gxx seems like a good candidate for being numerically unstable. Intuitively a first order method should be expected to make second order mistakes in the terms. Second order mistakes in the individual terms, after passing through that formula, wind up as constant order mistakes in the second derivative. Plus when your step size gets small, you are going to find that a quadratic formula makes even small roundoff mistakes turn into surprisingly large errors.
Instead I would suggest that you start by turning this into a first-order system of 4 functions, f, fx, g, and gx. And then proceed with backward's Euler on that system. Intuitively, with this approach, a first order method creates second order mistakes, which pass through a formula that creates first order mistakes of them. And now you are converging as you should from the start, and are also not as sensitive to propagation of roundoff errors.
I have a fairly simple function involving a logarithm of base 10 (f1 shown below). I need it to run as fast as possible since it is called millions of times as part of a larger code.
I tried with a Taylor approximation (f2 below) but even with a large expansion the accuracy is very poor and, even worse, it ends up taking a lot more time.
Have I reached the limit of performance attainable with numpy?
import time
import numpy as np
def f1(m1, m2):
return m1 - 2.5 * np.log10(1. + 10 ** (-.4 * (m2 - m1)))
def f2(m1, m2):
"""
Taylor expansion of 'f1'.
"""
x = -.4 * (m2 - m1)
return m1 - 2.5 * (
0.30102999 + .5 * x + 0.2878231366 * x ** 2 -
0.0635837 * x ** 4 + 0.0224742887 * x ** 6 -
0.00904311879 * x ** 8 + 0.00388579 * x ** 10)
# The data I actually use has more or less this range.
N = 1000
m1 = np.random.uniform(5., 30., N)
m2 = np.random.uniform(.7 * m1, m1)
# Test both functions
M = 5000
s = time.clock()
for _ in range(M):
mc1 = f1(m1, m2)
t1 = time.clock() - s
s = time.clock()
for _ in range(M):
mc2 = f2(m1, m2)
t2 = time.clock() - s
print(t1, t2, np.allclose(mc1, mc2, 0.01))
With this code-snippet, i'm not sure if you should optimize the log, but more the whole vector-expression itself.
You can try numexpr (Fast numerical array expression evaluator for Python, NumPy,...), which might do a lot for you.
The idea to try this came from Ignacio's comment which made me think where his speedup is coming from (i'm sure, it's not coming from the log calculation itself).
In my simple modification of your code:
import numexpr as ne
def f1(m1, m2):
return ne.evaluate("m1 - 2.5 * log10( 1.0 + 10 ** (-0.4 * (m2-m1)))")
it seems the above is 5 - 6x times as fast as (an unoptimized) f2 (approximation), while still giving the original accuracy.
It's also nearly twice as fast as the original numpy-approach f1.
These numbers might change depending on numexpr's setup as Intels MKL for example could be used too. As i'm too lazy to check my anaconda-based setup, i offer this just as a tech-demo, which everyone can try out too.
While i used numexpr a few times in the past for simple stuff, i might add, that it's also used within pandas, just to mention a real-world project depending on it's correct workings.
Disclaimer: i used your benchmark as template (and hope caching and co does not play a role).
Replace all of those exponentiations in f2 with multiplication:
def f2(m1, m2):
"""
Taylor expansion of 'f1'.
"""
x = -0.4 * (m2 - m1)
x2 = x * x
x4 = x2 * x2
x6 = x4 * x2
return m1 - 2.5 * (
0.30102999 + .5 * x + 0.2878231366 * x2 -
0.0635837 * x4 + 0.0224742887 * x6 -
0.00904311879 * x4 * x4 + 0.00388579 * x4 * x6)
Consider following equation system:
1623.66790917 * x ** 2 + 468.829686367 * x * y + 252.762128419 * y ** 2 + -1027209.42116 * x + -301192.975791 * y + 188804356.212 = 0
11154.1759415 * x ** 2 + 31741.0229155 * x * y + 32933.5622632 * y ** 2 + -16226174.4037 * x + -26323622.7497 * y + 6038609721.67 = 0
As you see there are two pairs of complex solutions to the system. I tried sympy but it was not a success. I want to know how to figure it out in Python. BTW I don't have a nice initial guess to use numeric methods.
You can solve those equations numerically using mpmath's findroot(). As far as I know there isn't a way to tell findroot() to find multiple roots, but we can get around that restriction: first, find a solution pair (xa, ya), then divide the equations by (x - xa)*(y - ya). You do have to supply an initial approximate solution, but I managed to find something that worked in only a few tries.
from mpmath import mp
mp.dps = 30
prec = 20
f_1 = lambda x, y: 1623.66790917 * x ** 2 + 468.829686367 * x * y + 252.762128419 * y ** 2 + -1027209.42116 * x + -301192.975791 * y + 188804356.212
f_2 = lambda x, y: 11154.1759415 * x ** 2 + 31741.0229155 * x * y + 32933.5622632 * y ** 2 + -16226174.4037 * x + -26323622.7497 * y + 6038609721.67
def solve(f1, f2, initxy):
return mp.findroot([f1, f2], initxy, solver='muller')
def show(x, y):
print 'x=', mp.nstr(x, prec)
print 'y=', mp.nstr(y, prec)
print mp.nstr(f_1(x, y), prec)
print mp.nstr(f_2(x, y), prec)
print
f1a = f_1
f2a = f_2
xa, ya = solve(f1a, f2a, (240+40j, 265-85j))
show(xa, ya)
f1b = lambda x, y: f1a(x, y) / ((x - xa) * (y - ya))
f2b = lambda x, y: f2a(x, y) / ((x - xa) * (y - ya))
xb, yb = solve(f1b, f2b, (290+20j, 270+30j))
show(xb, yb)
output
x= (246.82064795986653023 + 42.076841530787279711j)
y= (261.83565021239842638 - 81.555049135736951496j)
(0.0 + 3.3087224502121106995e-24j)
(0.0 + 0.0j)
x= (289.31873055121622967 + 20.548128321524345062j)
y= (272.23440694481666637 + 29.381152413744722108j)
(0.0 + 3.3087224502121106995e-24j)
(0.0 + 0.0j)
Make a program to find the roots of a quadratic equation using the Bhaskara formule:
for to calculate the square root using the formula: number ** 0.5
I can put the a,b and c, but when I run the program does't show the result of the roots x1 and x2 ...
This is my code so far:
a = int(input("a "))
b = int(input("b "))
c = int(input("c "))
delta = b * b - 4 * a * c
if (delta >= 0):
x1 = (-b + delta ** 0.5) / 2 * a
x2 = (-b - (delta) ** 0.5) / 2 * a
print("x1: ", x1)
print("x2: ", x2)
All a, b, and c real values have two roots (except in the case where the delta is 0)—but sometimes the roots are complex numbers, not real numbers.
In particular, if I remember correctly:
If delta > 0, there are two real roots.
If delta == 0, there is only one root, the real number -b/(2*a).
If delta < 0, there are two complex roots (which are always conjugates).
If you do your math with complex numbers, you can use the same formula, (-b +/- delta**0.5) / 2a, for all three cases, and you'll get two real numbers, or 0 twice, or two complex numbers, as appropriate.
There are also ways to calculate the real and imaginary parts of the third case without doing complex math, but since Python makes complex math easy, why bother unless you're specifically trying to learn about those ways?
So, if you always want to print 2 roots, all you have to do is remove that if delta >= 0: line (and dedent the next few lines). Raising a negative float to the 0.5 power will give you a complex automatically, and that will make the rest of the expression complex. Like this:
delta = b * b - 4 * a * c
x1 = (-b + delta ** 0.5) / 2 * a
x2 = (-b - (delta) ** 0.5) / 2 * a
print("x1: ", x1)
print("x2: ", x2)
If you only want 0-2 real roots, your code is already correct as-is. You might want to add a check for delta == 0, or just for x1 == x2, so you don't print the same value twice. Like this:
delta = b * b - 4 * a * c
if delta >= 0:
x1 = (-b + delta ** 0.5) / 2 * a
x2 = (-b - (delta) ** 0.5) / 2 * a
print("x1: ", x1)
if x1 != x2:
print("x2: ", x2)
If you want some kind of error message, all you need to do is add an else clause. Something like this:
delta = b * b - 4 * a * c
if delta >= 0:
x1 = (-b + delta ** 0.5) / 2 * a
x2 = (-b - (delta) ** 0.5) / 2 * a
print("x1: ", x1)
print("x2: ", x2)
else:
print('No real solutions because of negative delta: ", delta)
Which one do you want? I have no idea. That's a question for you to answer. Once you decide what output you want for, say, 3, 4, and 5, you can pick the version that gives you that output.