I'm having trouble with the slow computation of my Python code. Based on the pycallgraph below, the bottleneck seems to be the module named miepython.miepython.mie_S1_S2 (highlighted by pink), which takes 0.47 seconds per call.
The source code for this module is as follows:
import numpy as np
from numba import njit, int32, float64, complex128
__all__ = ('ez_mie',
'ez_intensities',
'generate_mie_costheta',
'i_par',
'i_per',
'i_unpolarized',
'mie',
'mie_S1_S2',
'mie_cdf',
'mie_mu_with_uniform_cdf',
)
#njit((complex128, float64, float64[:]), cache=True)
def _mie_S1_S2(m, x, mu):
"""
Calculate the scattering amplitude functions for spheres.
The amplitude functions have been normalized so that when integrated
over all 4*pi solid angles, the integral will be qext*pi*x**2.
The units are weird, sr**(-0.5)
Args:
m: the complex index of refraction of the sphere
x: the size parameter of the sphere
mu: array of angles, cos(theta), to calculate scattering amplitudes
Returns:
S1, S2: the scattering amplitudes at each angle mu [sr**(-0.5)]
"""
nstop = int(x + 4.05 * x**0.33333 + 2.0) + 1
a = np.zeros(nstop - 1, dtype=np.complex128)
b = np.zeros(nstop - 1, dtype=np.complex128)
_mie_An_Bn(m, x, a, b)
nangles = len(mu)
S1 = np.zeros(nangles, dtype=np.complex128)
S2 = np.zeros(nangles, dtype=np.complex128)
nstop = len(a)
for k in range(nangles):
pi_nm2 = 0
pi_nm1 = 1
for n in range(1, nstop):
tau_nm1 = n * mu[k] * pi_nm1 - (n + 1) * pi_nm2
S1[k] += (2 * n + 1) * (pi_nm1 * a[n - 1]
+ tau_nm1 * b[n - 1]) / (n + 1) / n
S2[k] += (2 * n + 1) * (tau_nm1 * a[n - 1]
+ pi_nm1 * b[n - 1]) / (n + 1) / n
temp = pi_nm1
pi_nm1 = ((2 * n + 1) * mu[k] * pi_nm1 - (n + 1) * pi_nm2) / n
pi_nm2 = temp
# calculate norm = sqrt(pi * Qext * x**2)
n = np.arange(1, nstop + 1)
norm = np.sqrt(2 * np.pi * np.sum((2 * n + 1) * (a.real + b.real)))
S1 /= norm
S2 /= norm
return [S1, S2]
Apparently, the source code is jitted by Numba so it should be faster than it actually is. The number of iterations in for loop in this function is around 25,000 (len(mu)=50, len(a)-1=500).
Any ideas on how to speed up this computation? Is something hindering the fast computation of Numba? Or, do you think the computation is already fast enough?
[More details]
In the above, another function _mie_An_Bn is being used. This function is also jitted, and the source code is as follows:
#njit((complex128, float64, complex128[:], complex128[:]), cache=True)
def _mie_An_Bn(m, x, a, b):
"""
Compute arrays of Mie coefficients A and B for a sphere.
This estimates the size of the arrays based on Wiscombe's formula. The length
of the arrays is chosen so that the error when the series are summed is
around 1e-6.
Args:
m: the complex index of refraction of the sphere
x: the size parameter of the sphere
Returns:
An, Bn: arrays of Mie coefficents
"""
psi_nm1 = np.sin(x) # nm1 = n-1 = 0
psi_n = psi_nm1 / x - np.cos(x) # n = 1
xi_nm1 = complex(psi_nm1, np.cos(x))
xi_n = complex(psi_n, np.cos(x) / x + np.sin(x))
nstop = len(a)
if m.real > 0.0:
D = _D_calc(m, x, nstop + 1)
for n in range(1, nstop):
temp = D[n] / m + n / x
a[n - 1] = (temp * psi_n - psi_nm1) / (temp * xi_n - xi_nm1)
temp = D[n] * m + n / x
b[n - 1] = (temp * psi_n - psi_nm1) / (temp * xi_n - xi_nm1)
xi = (2 * n + 1) * xi_n / x - xi_nm1
xi_nm1 = xi_n
xi_n = xi
psi_nm1 = psi_n
psi_n = xi_n.real
else:
for n in range(1, nstop):
a[n - 1] = (n * psi_n / x - psi_nm1) / (n * xi_n / x - xi_nm1)
b[n - 1] = psi_n / xi_n
xi = (2 * n + 1) * xi_n / x - xi_nm1
xi_nm1 = xi_n
xi_n = xi
psi_nm1 = psi_n
psi_n = xi_n.real
The example inputs are like the followings:
m = 1.336-2.462e-09j
x = 8526.95
mu = np.array([-1., -0.7500396, 0.46037385, 0.5988121, 0.67384093, 0.72468684, 0.76421644, 0.79175856, 0.81723714, 0.83962897, 0.85924182, 0.87641596, 0.89383665, 0.90708978, 0.91931481, 0.93067567, 0.94073113, 0.94961222, 0.95689496, 0.96467123, 0.97138347, 0.97791831, 0.98339434, 0.98870543, 0.99414948, 0.9975728 0.9989995, 0.9989995, 0.9989995, 0.9989995, 0.9989995,0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899952, 0.99899952,
0.99899952, 0.99899952, 0.99899952, 0.99899952, 0.99899952, 0.99899952, 0.99899952, 1. ])
I am focussing on _mie_S1_S2 since it appear to be the most expensive function on the provided example dataset.
First of all, you can use the parameter fastmath=True to the JIT to accelerate the computation if there is no values like +Inf, -Inf, -0 or NaN computed.
Then you can pre-compute some expensive expression containing divisions or implicit integer-to-float conversions. Note that (2 * n + 1) / n = 2 + 1/n and (n + 1) / n = 1 + 1/n. This can be useful to reduce the number of precomputed array but did not change the performance on my machine (this may change regarding the target architecture). Note also that such a precomputation have a slight impact on the result accuracy (most of the time negligible and sometime better than the reference implementation).
On my machine, this strategy make the code 4.5 times faster with fastmath=True and 2.8 times faster without.
The k-based loop can be parallelized using parallel=True and prange of Numba. However, this may not be always faster on all machines (especially the ones with a lot of cores) since the loop is pretty fast.
Here is the final code:
#njit((complex128, float64, float64[:]), cache=True, parallel=True)
def _mie_S1_S2_opt(m, x, mu):
nstop = int(x + 4.05 * x**0.33333 + 2.0) + 1
a = np.zeros(nstop - 1, dtype=np.complex128)
b = np.zeros(nstop - 1, dtype=np.complex128)
_mie_An_Bn(m, x, a, b)
nangles = len(mu)
S1 = np.zeros(nangles, dtype=np.complex128)
S2 = np.zeros(nangles, dtype=np.complex128)
factor1 = np.empty(nstop, dtype=np.float64)
factor2 = np.empty(nstop, dtype=np.float64)
factor3 = np.empty(nstop, dtype=np.float64)
for n in range(1, nstop):
factor1[n - 1] = (2 * n + 1) / (n + 1) / n
factor2[n - 1] = (2 * n + 1) / n
factor3[n - 1] = (n + 1) / n
nstop = len(a)
for k in nb.prange(nangles):
pi_nm2 = 0
pi_nm1 = 1
for n in range(1, nstop):
i = n - 1
tau_nm1 = n * mu[k] * pi_nm1 - (n + 1.0) * pi_nm2
S1[k] += factor1[i] * (pi_nm1 * a[i] + tau_nm1 * b[i])
S2[k] += factor1[i] * (tau_nm1 * a[i] + pi_nm1 * b[i])
temp = pi_nm1
pi_nm1 = factor2[i] * mu[k] * pi_nm1 - factor3[i] * pi_nm2
pi_nm2 = temp
# calculate norm = sqrt(pi * Qext * x**2)
n = np.arange(1, nstop + 1)
norm = np.sqrt(2 * np.pi * np.sum((2 * n + 1) * (a.real + b.real)))
S1 /= norm
S2 /= norm
return [S1, S2]
%timeit -n 1000 _mie_S1_S2_opt(m, x, mu)
On my machine with 6 cores, the final optimized implementation is 12 times faster with fastmath=True and 8.8 times faster without. Note that using similar strategies in other functions may also helps to speed up them.
I am trying to calculate g(x_(i+2)) from the value g(x_(i+1)) and g(x_i), i is an integer, assuming I(x) and s(x) are Gaussian function. If we know x_i = 100, then the summation from 0 to 100, I don't know how to handle g(x_i) with the subscript in python, knowing the first and second value, we can find the third value, after n cycle, we can find the nth value.
Equation:
code:
import numpy as np
from matplotlib import pyplot as p
from math import pi
def f_s(x, mu_s, sig_s):
ss = -np.power(x - mu_s, 2) / (2 * np.power(sig_s, 2))
return np.exp(ss) / (np.power(2 * pi, 2) * sig_s)
def f_i(x, mu_i, sig_i):
ii = -np.power(x - mu_i, 2) / (2 * np.power(sig_i, 2))
return np.exp(ii) / (np.power(2 * pi, 2) * sig_i)
# problems occur in this part
def g(x, m, mu_s, sig_s, mu_i, sig_i):
for i in range(1, m): # specify the number x, x_1, x_2, x_3 ......X_m
h = (x[i + 1] - x[i]) / e
for n in range(0, x[i]): # calculate summation
sum_f = (f_i(x[i], mu_i, sig_i) - f_s(x[i] - n, mu_s, sig_s) * g_x[n]) * np.conj(f_s(n +
x[i], mu_s, sig_s))
g_x[1] = 1 # initial value
g_x[2] = 5
g_x[i + 2] = h * sum_f + 2 * g_x[i + 1] - g_x[i]
return g_x[i + 2]
x = np.linspace(-10, 10, 10000)
e = 1
d = 0.01
m = 1000
mu_s = 2
sig_s = 1
mu_i = 1
sig_i = 1
p.plot(x, g(x, m, mu_s, sig_s, mu_i, sig_i))
p.legend()
p.show()
result:
I(x) and s(x)
import math
import numpy as np
S0 = 100.; K = 100.; T = 1.0; r = 0.05; sigma = 0.2
M = 100; dt = T / M; I = 500000
S = np.zeros((M + 1, I))
S[0] = S0
for t in range(1, M + 1):
z = np.random.standard_normal(I)
S[t] = S[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt + sigma *
math.sqrt(dt) * z)
C0 = math.exp(-r * T) * np.sum(np.maximum(S[-1] - K, 0)) / I
print ("European Option Value is ", C0)
It gives a value of around 10.45 as you increase the number of simulations, but using the B-S formula the value should be around 10.09. Anybody know why the code isn't giving a number closer to the formula?
Pardon me but I could find a better title. Please look at this super-simple Python program:
x = start = 1.0
target = 0.1
coeff = 0.999
for c in range(100000):
print('{:5d} {:f}'.format(c, x))
if abs(target - x) < abs((x - start) * 0.01):
break
x = x * coeff + target * (1 - coeff)
Brief explaination: this program moves x towards target calculating iteratively the weighted average of x and target with coeff as weight. It stops when x reaches the 1% of the initial difference.
The number of iterations remains the same no matter the initial value of x and target.
How can I set coeff in order to predict how many iterations will take place?
Thanks a lot.
Let's make this a function, f.
f(0) is the initial value (start, in this case 1.0).
f(x) = f(x - 1) * c + T * (1 - c).
(So f(1) is the next value of x, f(2) is the one after that, and so on. We want to find the value of x where |T - f(x)| < 0.01 * |f(0) - f(x)|)
So let's rewrite f(x) to be linear:
f(x) = f(x - 1) * c + T * (1 - c)
= (f(x - 2) * c + T * (1 - c)) * c + T * (1 - c)
= (f(x - 2) * c ** 2 + T * c * (1 - c)) + T * (1 - c)
= ((f(x - 3) * c + T * (1 - c)) * c ** 2 + T * c * (1 - c)) + T * (1 - c)
= f(x - 3) * c ** 3 + T * c ** 2 * (1 - c) + T * c * (1 - c) + T * (1 - c)
= f(0) * c ** x + T * c ** (x - 1) * (1 - c) + T * c ** (x - 2) * (1 - c) + ... + T * c * (1 - c) + T * (1 - c)
= f(0) * c ** x + (T * (1 - c)) [(sum r = 0 to x - 1) (c ** r)]
# Summation of a geometric series
= f(0) * c ** x + (T * (1 - c)) ((1 - c ** x) / (1 - c))
= f(0) * c ** x + T (1 - c ** x)
So, the nth value of x will be start * c ** n + target * (1 - c ** n).
We want:
|T - f(x)| < 0.01 * |f(0) - f(x)|
|T - f(0) * c ** x - T (1 - c ** x)| < 0.01 * |f(0) - f(0) * c ** x - T (1 - c ** x)|
|(c ** x) * T - (c ** x) f(0)| < 0.01 * |(1 - c ** x) * f(0) - (1 - c ** x) * T|
(c ** x) * |T - f(0)| < 0.01 * (1 - c ** x) * |T - f(0)|
c ** x < 0.01 * (1 - c ** x)
c ** x < 0.01 - 0.01 * c ** x
1.01 * c ** x < 0.01
c ** x < 1 / 101
x < log (1 / 101) / log c
(I somehow ended up with x < when it should be x >, but it gives the correct answer. With c = 0.999, x > 4612.8, and it terminates on step 4613).
In the end, it is independant of start and target.
Also, for a general percentage difference of p,
c ** x > p * (1 - c ** x)
c ** x > p - p c ** x
(1 + p) c ** x > p
c ** x > p / (1 + p)
x > log (p / (1 + p)) / log c
So for a coefficient of c, there will be log (1 / 101) / log c steps.
If you have the number of steps you want, call it I, you have
I = log_c(1 / 101)
c ** I = 1 / 101
c = (1 / 101) ** (1 / I)
So c should be set to the Ith root of 1 / 101.
Your code reduces the distance between x and the target by a factor of coeff in each execution of the loop. Thus, if start is greater than target, we get the formula
target - x = (x - start) * coeff ** c
where c is the number of loops we have done.
Your ending criterion is (again, if start is greater than target),
x - target < (start - x) * 0.01
Solving for x by algebra we get
x > (target + 0.01 * s) / (1 + 0.01)
Substituting that into our first expression and simplifying a bit makes both start and target drop out of the inequality--now you see why those values did not matter--and we get
0.01 / (1 + 0.01) < coeff ** c
Solving for c we get
c > log(0.01 / (1 + 0.01), coeff)
So the final answer for the number of loops is
ceil(log(0.01 / (1 + 0.01), coeff))
or alternatively, if you do not like logarithms to an arbitrary base,
ceil(log(0.01 / (1 + 0.01)) / log(coeff))
You could replace that first logarithm in that last expression with its result, but I left it that way to see what different result you would get if you change the constant in your end criterion away from 0.01.
The result of that expression in your particular case is
4613
which is correct. Note that both the ceil and log function are in Python's math unit, so remember to import those functions before doing that calculation. Also note that Python's floating point calculations are not exact, so your actual number of loops may differ from that by one, if you change the values of coeff or of 0.01.
def calculate(i,j,m,k,n):
for v in range(1,n+1):
ans = (i*k + j) % m
k = ans
return ans
The program represents a general formula where x = (i * k + j) % m where k is the value of the previous answer. In a sense, it's basically x1 = (i * x0 + j) % m, and x2 = (i * x1 + j) % m, and so forth. The problem I'm having is that it takes a long while to calculate large inputs.
With that in mind, I was thinking along the lines of using an arithmetic series formula such as: a + (n - 1) * d), but I'm unsure on how to implement it in a program such as this.
x1 = (i * x0 + j)
x2 = (i * x1 + j) = i * i * x0 + i * j + j
x3 = i * i * i * x0 + i * i * j + i * j + j
xn = i^n * x0 + sum(i^t for t from 0 to n - 1) * j
= i^n * x0 + (i^n - 1) / (i - 1) * j
Found the last line with Wolfram Alpha.
The formula is nice if m is a prime.
In that case you can perform all the computations modulo that prime,
including the division, to keep the numbers small.
You'd just need to get exponentiation i^n fast.
I suggest you look at https://en.wikipedia.org/wiki/Modular_exponentiation#Right-to-left_binary_method and references therein.
This should give you O(log(n)) time complexity, compared to the O(n) of your loop.
If m is not a prime, the division in the above formula is annoying. But you can do something similar to exponentiation by squaring to compute the sum, too. Observe
1 + i + i^2 + i^3 + i^4 + i^5 + i^6 + … + i^(2n+1) =
(1 + i) * (1 + i^2 + i^4 + i^6 + … + i^n)
1 + i + i^2 + i^3 + i^4 + i^5 + i^6 + … + i^(2n+2) =
1 + (i + i^2) * (1 + i^2 + i^4 + i^6 + … + i^n)
so you can half the number of summands in the right parenthesis at each step. Now there is no division, so you can perform modulo operations after each operation.
You can thus define something like
def modpowsum(a, n, m):
"""(1 + a + a^2 + a^3 + ... + a^n) mod m"""
if n == 0:
return 1
if n == 1:
return (1 + a) % m
if n % 2 == 1:
return ((1 + a) * modpowsum((a * a) % m, (n - 1) // 2, m)) % m
return (1 + ((a + a * a) % m) * modpowsum((a * a) % m, n // 2 - 1, m)) % m
The whole computation can be seen at https://ideone.com/Xh0Fuf running some random and some not-so-random test cases against your implementation.