Having problems making this program run faster for large inputs

Having problems making this program run faster for large inputs - python

def calculate(i,j,m,k,n):
for v in range(1,n+1):
ans = (i*k + j) % m
k = ans
return ans
The program represents a general formula where x = (i * k + j) % m where k is the value of the previous answer. In a sense, it's basically x1 = (i * x0 + j) % m, and x2 = (i * x1 + j) % m, and so forth. The problem I'm having is that it takes a long while to calculate large inputs.
With that in mind, I was thinking along the lines of using an arithmetic series formula such as: a + (n - 1) * d), but I'm unsure on how to implement it in a program such as this.

x1 = (i * x0 + j)
x2 = (i * x1 + j) = i * i * x0 + i * j + j
x3 = i * i * i * x0 + i * i * j + i * j + j
xn = i^n * x0 + sum(i^t for t from 0 to n - 1) * j
= i^n * x0 + (i^n - 1) / (i - 1) * j
Found the last line with Wolfram Alpha.
The formula is nice if m is a prime.
In that case you can perform all the computations modulo that prime,
including the division, to keep the numbers small.
You'd just need to get exponentiation i^n fast.
I suggest you look at https://en.wikipedia.org/wiki/Modular_exponentiation#Right-to-left_binary_method and references therein.
This should give you O(log(n)) time complexity, compared to the O(n) of your loop.
If m is not a prime, the division in the above formula is annoying. But you can do something similar to exponentiation by squaring to compute the sum, too. Observe
1 + i + i^2 + i^3 + i^4 + i^5 + i^6 + … + i^(2n+1) =
(1 + i) * (1 + i^2 + i^4 + i^6 + … + i^n)
1 + i + i^2 + i^3 + i^4 + i^5 + i^6 + … + i^(2n+2) =
1 + (i + i^2) * (1 + i^2 + i^4 + i^6 + … + i^n)
so you can half the number of summands in the right parenthesis at each step. Now there is no division, so you can perform modulo operations after each operation.
You can thus define something like
def modpowsum(a, n, m):
"""(1 + a + a^2 + a^3 + ... + a^n) mod m"""
if n == 0:
return 1
if n == 1:
return (1 + a) % m
if n % 2 == 1:
return ((1 + a) * modpowsum((a * a) % m, (n - 1) // 2, m)) % m
return (1 + ((a + a * a) % m) * modpowsum((a * a) % m, n // 2 - 1, m)) % m
The whole computation can be seen at https://ideone.com/Xh0Fuf running some random and some not-so-random test cases against your implementation.

Related

How to speed up the computation that is slow even with Numba

I'm having trouble with the slow computation of my Python code. Based on the pycallgraph below, the bottleneck seems to be the module named miepython.miepython.mie_S1_S2 (highlighted by pink), which takes 0.47 seconds per call.
The source code for this module is as follows:
import numpy as np
from numba import njit, int32, float64, complex128
__all__ = ('ez_mie',
'ez_intensities',
'generate_mie_costheta',
'i_par',
'i_per',
'i_unpolarized',
'mie',
'mie_S1_S2',
'mie_cdf',
'mie_mu_with_uniform_cdf',
)
#njit((complex128, float64, float64[:]), cache=True)
def _mie_S1_S2(m, x, mu):
"""
Calculate the scattering amplitude functions for spheres.
The amplitude functions have been normalized so that when integrated
over all 4*pi solid angles, the integral will be qext*pi*x**2.
The units are weird, sr**(-0.5)
Args:
m: the complex index of refraction of the sphere
x: the size parameter of the sphere
mu: array of angles, cos(theta), to calculate scattering amplitudes
Returns:
S1, S2: the scattering amplitudes at each angle mu [sr**(-0.5)]
"""
nstop = int(x + 4.05 * x**0.33333 + 2.0) + 1
a = np.zeros(nstop - 1, dtype=np.complex128)
b = np.zeros(nstop - 1, dtype=np.complex128)
_mie_An_Bn(m, x, a, b)
nangles = len(mu)
S1 = np.zeros(nangles, dtype=np.complex128)
S2 = np.zeros(nangles, dtype=np.complex128)
nstop = len(a)
for k in range(nangles):
pi_nm2 = 0
pi_nm1 = 1
for n in range(1, nstop):
tau_nm1 = n * mu[k] * pi_nm1 - (n + 1) * pi_nm2
S1[k] += (2 * n + 1) * (pi_nm1 * a[n - 1]
+ tau_nm1 * b[n - 1]) / (n + 1) / n
S2[k] += (2 * n + 1) * (tau_nm1 * a[n - 1]
+ pi_nm1 * b[n - 1]) / (n + 1) / n
temp = pi_nm1
pi_nm1 = ((2 * n + 1) * mu[k] * pi_nm1 - (n + 1) * pi_nm2) / n
pi_nm2 = temp
# calculate norm = sqrt(pi * Qext * x**2)
n = np.arange(1, nstop + 1)
norm = np.sqrt(2 * np.pi * np.sum((2 * n + 1) * (a.real + b.real)))
S1 /= norm
S2 /= norm
return [S1, S2]
Apparently, the source code is jitted by Numba so it should be faster than it actually is. The number of iterations in for loop in this function is around 25,000 (len(mu)=50, len(a)-1=500).
Any ideas on how to speed up this computation? Is something hindering the fast computation of Numba? Or, do you think the computation is already fast enough?
[More details]
In the above, another function _mie_An_Bn is being used. This function is also jitted, and the source code is as follows:
#njit((complex128, float64, complex128[:], complex128[:]), cache=True)
def _mie_An_Bn(m, x, a, b):
"""
Compute arrays of Mie coefficients A and B for a sphere.
This estimates the size of the arrays based on Wiscombe's formula. The length
of the arrays is chosen so that the error when the series are summed is
around 1e-6.
Args:
m: the complex index of refraction of the sphere
x: the size parameter of the sphere
Returns:
An, Bn: arrays of Mie coefficents
"""
psi_nm1 = np.sin(x) # nm1 = n-1 = 0
psi_n = psi_nm1 / x - np.cos(x) # n = 1
xi_nm1 = complex(psi_nm1, np.cos(x))
xi_n = complex(psi_n, np.cos(x) / x + np.sin(x))
nstop = len(a)
if m.real > 0.0:
D = _D_calc(m, x, nstop + 1)
for n in range(1, nstop):
temp = D[n] / m + n / x
a[n - 1] = (temp * psi_n - psi_nm1) / (temp * xi_n - xi_nm1)
temp = D[n] * m + n / x
b[n - 1] = (temp * psi_n - psi_nm1) / (temp * xi_n - xi_nm1)
xi = (2 * n + 1) * xi_n / x - xi_nm1
xi_nm1 = xi_n
xi_n = xi
psi_nm1 = psi_n
psi_n = xi_n.real
else:
for n in range(1, nstop):
a[n - 1] = (n * psi_n / x - psi_nm1) / (n * xi_n / x - xi_nm1)
b[n - 1] = psi_n / xi_n
xi = (2 * n + 1) * xi_n / x - xi_nm1
xi_nm1 = xi_n
xi_n = xi
psi_nm1 = psi_n
psi_n = xi_n.real
The example inputs are like the followings:
m = 1.336-2.462e-09j
x = 8526.95
mu = np.array([-1., -0.7500396, 0.46037385, 0.5988121, 0.67384093, 0.72468684, 0.76421644, 0.79175856, 0.81723714, 0.83962897, 0.85924182, 0.87641596, 0.89383665, 0.90708978, 0.91931481, 0.93067567, 0.94073113, 0.94961222, 0.95689496, 0.96467123, 0.97138347, 0.97791831, 0.98339434, 0.98870543, 0.99414948, 0.9975728 0.9989995, 0.9989995, 0.9989995, 0.9989995, 0.9989995,0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899951, 0.99899952, 0.99899952,
0.99899952, 0.99899952, 0.99899952, 0.99899952, 0.99899952, 0.99899952, 0.99899952, 1. ])

I am focussing on _mie_S1_S2 since it appear to be the most expensive function on the provided example dataset.
First of all, you can use the parameter fastmath=True to the JIT to accelerate the computation if there is no values like +Inf, -Inf, -0 or NaN computed.
Then you can pre-compute some expensive expression containing divisions or implicit integer-to-float conversions. Note that (2 * n + 1) / n = 2 + 1/n and (n + 1) / n = 1 + 1/n. This can be useful to reduce the number of precomputed array but did not change the performance on my machine (this may change regarding the target architecture). Note also that such a precomputation have a slight impact on the result accuracy (most of the time negligible and sometime better than the reference implementation).
On my machine, this strategy make the code 4.5 times faster with fastmath=True and 2.8 times faster without.
The k-based loop can be parallelized using parallel=True and prange of Numba. However, this may not be always faster on all machines (especially the ones with a lot of cores) since the loop is pretty fast.
Here is the final code:
#njit((complex128, float64, float64[:]), cache=True, parallel=True)
def _mie_S1_S2_opt(m, x, mu):
nstop = int(x + 4.05 * x**0.33333 + 2.0) + 1
a = np.zeros(nstop - 1, dtype=np.complex128)
b = np.zeros(nstop - 1, dtype=np.complex128)
_mie_An_Bn(m, x, a, b)
nangles = len(mu)
S1 = np.zeros(nangles, dtype=np.complex128)
S2 = np.zeros(nangles, dtype=np.complex128)
factor1 = np.empty(nstop, dtype=np.float64)
factor2 = np.empty(nstop, dtype=np.float64)
factor3 = np.empty(nstop, dtype=np.float64)
for n in range(1, nstop):
factor1[n - 1] = (2 * n + 1) / (n + 1) / n
factor2[n - 1] = (2 * n + 1) / n
factor3[n - 1] = (n + 1) / n
nstop = len(a)
for k in nb.prange(nangles):
pi_nm2 = 0
pi_nm1 = 1
for n in range(1, nstop):
i = n - 1
tau_nm1 = n * mu[k] * pi_nm1 - (n + 1.0) * pi_nm2
S1[k] += factor1[i] * (pi_nm1 * a[i] + tau_nm1 * b[i])
S2[k] += factor1[i] * (tau_nm1 * a[i] + pi_nm1 * b[i])
temp = pi_nm1
pi_nm1 = factor2[i] * mu[k] * pi_nm1 - factor3[i] * pi_nm2
pi_nm2 = temp
# calculate norm = sqrt(pi * Qext * x**2)
n = np.arange(1, nstop + 1)
norm = np.sqrt(2 * np.pi * np.sum((2 * n + 1) * (a.real + b.real)))
S1 /= norm
S2 /= norm
return [S1, S2]
%timeit -n 1000 _mie_S1_S2_opt(m, x, mu)
On my machine with 6 cores, the final optimized implementation is 12 times faster with fastmath=True and 8.8 times faster without. Note that using similar strategies in other functions may also helps to speed up them.

np.int64 behaves differently from int in math-operations

I have come across a very strange problem where i do a lot of math and the result is inf or nan when my input is of type <class 'numpy.int64'>, but i get the correct (checked analytically) results when my input is of type <class 'int'>. The only library functions i use are np.math.factorial(), np.sum() and np.array(). I also use a generator object to sum over series and the Boltzmann constant from scipy.constants.
My question is essentially this: Are their any known cases where np.int64 objects will behave very differently from int objects?
When i run with np.int64 input, i get the RuntimeWarnings: overflow encountered in long_scalars, divide by zero encountered in double_scalars and invalid value encountered in double_scalars. However, the largest number i plug into the factorial function is 36, and i don't get these warnings when i use int input.
Below is a code that reproduces the behaviour. I was unable to find out more exactly where it comes from.
import numpy as np
import scipy.constants as const
# Some representible numbers
sigma = np.array([1, 2])
sigma12 = 1.5
mole_weights = np.array([10,15])
T = 100
M1, M2 = mole_weights/np.sum(mole_weights)
m0 = np.sum(mole_weights)
fac = np.math.factorial
def summation(start, stop, func, args=None):
#sum over the function func for all ints from start to and including stop, pass 'args' as additional arguments
if args is not None:
return sum(func(i, args) for i in range(start, stop + 1))
else:
return sum(func(i) for i in range(start, stop + 1))
def delta(i, j):
#kronecker delta
if i == j:
return 1
else:
return 0
def w(l, r):
# l,r are ints, return a float
return 0.25 * (2 - ((1 / (l + 1)) * (1 + (-1) ** l))) * np.math.factorial(r + 1)
def omega(ij, l, r):
# l, r are int, ij is and ID, returns float
if ij in (1, 2):
return sigma[ij - 1] ** 2 * np.sqrt(
(np.pi * const.Boltzmann * T) / mole_weights[ij - 1]) * w(l, r)
elif ij in (12, 21):
return 0.5 * sigma12 ** 2 * np.sqrt(
2 * np.pi * const.Boltzmann * T / (m0 * M1 * M2)) * w(l, r)
else:
raise ValueError('(' + str(ij) + ', ' + str(l) + ', ' + str(r) + ') are non-valid arguments for omega.')
def A_prime(p, q, r, l):
'''
p, q, r, l are ints. returns a float
'''
F = (M1 ** 2 + M2 ** 2) / (2 * M1 * M2)
G = (M1 - M2) / M2
def inner(w, args):
i, k = args
return ((8 ** i * fac(p + q - 2 * i - w) * (-1) ** (r + i) * fac(r + 1) * fac(
2 * (p + q + 2 - i - w)) * 2 ** (2 * r) * F ** (i - k) * G ** w) /
(fac(p - i - w) * fac(q - i - w) * fac(r - i) * fac(p + q + 1 - i - r - w) * fac(2 * r + 2) * fac(
p + q + 2 - i - w)
* 4 ** (p + q + 1) * fac(k) * fac(i - k) * fac(w))) * (
2 ** (2 * w - 1) * M1 ** i * M2 ** (p + q - i - w)) * 2 * (
M1 * (p + q + 1 - i - r - w) * delta(k, l) - M2 * (r - i) * delta(k, l - 1))
def sum_w(k, i):
return summation(0, min(p, q, p + q + 1 - r) - i, inner, args=(i, k))
def sum_k(i):
return summation(l - 1, min(l, i), sum_w, args=i)
return summation(l - 1, min(p, q, r, p + q + 1 - r), sum_k)
def H_i(p, q):
'''
p, q are ints. Returns a float
'''
def inner(r, l):
return A_prime(p, q, r, l) * omega(12, l, r)
def sum_r(l):
return summation(l, p + q + 2 - l, inner, args=l)
val = 8 * summation(1, min(p, q) + 1, sum_r)
return val
p, q = np.int64(8), np.int64(8)
print(H_i(p,q)) #nan
print(H_i(int(p) ,int(q))) #1.3480582058153066e-08

Numpy's int64 is a 64-bit integer, meaning it consists of 64 places that are either 0 or 1. Thus the smallest representable value is -2**63 and the biggest one is 2**63 - 1
Python's int is essentially unlimited in length, so it can represent any value. It is equivalent to a BigInteger in Java. It's stored as a list of int64s essentially that are considered a single large number.
What you have here is a classic integer overflow. You mentioned that you "only" plug 36 into the factorial function, but the factorial function grows very fast, and 36! = 3.7e41 > 9.2e18 = 2**63 - 1, so you get a number bigger than you can represent in an int64!
Since int64s are also called longs this is exactly what the warning overflow encountered in long_scalars is trying to tell you!

How can I resolve this binomial equation by coding?

I'm having a problem finding out how to discover the write function to solve this problem:
Write a function that will take as an input two numbers (l,m) and return as a tuple the coefficients (a,b,c) for the quadratic equation a x^2 + b x + c found from expanding (x + l) * (x + m).
def func(l,m):
a = 1
equation = (a * (x ** 2)) + (b * x) + c
coef = [a,b,c]
eq2 = (x + m) * (x + l)
coef1 = m + l
coef2 = m * l
if coef1 == coef[1] and coef2 == coef[2]:
return coef
func(2,2)

Just to make it clear:
Your problem states:
return as a tuple the coefficients (a,b,c) for the quadratic equation
a x^2 + b x + c found from expanding (x + l) * (x + m).
Let's find the equation by expanding:
(x + l) * (x + m) =
= x^2 + l*x + m*x + l*m =
= x^2 + (l+m)*x + l*m
Now, by coefficients comparison with a x^2 + b x + c, we get that:
a = 1
b = l + m
c = l * m
So your function can basically return (1, l + m, l * m) directly...

Now that we have your code, I can tell you you're not using Python functions right. You can't create an unknown variable as you can do in math (here you called x)
There are modules who allow such operation with different syntax such as SymPy.
If you don't want to use it and you want to solve it "by-hand" maybe for a school project you'll need to compute a, b and conly from l and m with formulas.
As mentionned Tomerikoo
a = 1
b = l + m
c = l * m

How to create a Single Vector having 2 Dimensions?

I have used the Equation of Motion (Newtons Law) for a simple spring and mass scenario incorporating it into the given 2nd ODE equation y" + (k/m)x = 0; y(0) = 3; y'(0) = 0.
I have then been able to run a code that calculates and compares the Exact Solution with the Runge-Kutta Method Solution.
It works fine...however, I have recently been asked not to separate my values of 'x' and 'v', but use a single vector 'x' that has two dimensions ( i.e. 'x' and 'v' can be handled by x(1) and x(2) ).
MY CODE:
# Given is y" + (k/m)x = 0; y(0) = 3; y'(0) = 0
# Parameters
h = 0.01; #Step Size
t = 100.0; #Time(sec)
k = 1;
m = 1;
x0 = 3;
v0 = 0;
# Exact Analytical Solution
te = np.arange(0, t ,h);
N = len(te);
w = (k / m) ** 0.5;
x_exact = x0 * np.cos(w * te);
v_exact = -x0 * w * np.sin(w * te);
# Runge-kutta Method
x = np.empty(N);
v = np.empty(N);
x[0] = x0;
v[0] = v0;
def f1 (t, x, v):
x = v
return x
def f2 (t, x, v):
v = -(k / m) * x
return v
for i in range(N - 1): #MAIN LOOP
K1x = f1(te[i], x[i], v[i])
K1v = f2(te[i], x[i], v[i])
K2x = f1(te[i] + h / 2, x[i] + h * K1x / 2, v[i] + h * K1v / 2)
K2v = f2(te[i] + h / 2, x[i] + h * K1x / 2, v[i] + h * K1v / 2)
K3x = f1(te[i] + h / 2, x[i] + h * K2x / 2, v[i] + h * K2v / 2)
K3v = f2(te[i] + h / 2, x[i] + h * K2x / 2, v[i] + h * K2v / 2)
K4x = f1(te[i] + h, x[i] + h * K3x, v[i] + h * K3v)
K4v = f2(te[i] + h, x[i] + h * K3x, v[i] + h * K3v)
x[i + 1] = x[i] + h / 6 * (K1x + 2 * K2x + 2 * K3x + K4x)
v[i + 1] = v[i] + h / 6 * (K1v + 2 * K2v + 2 * K3v + K4v)
Can anyone help me understand how I can create this single vector having 2 dimensions, and how to fix my code up please?

You can use np.array() function, here is an example of what you're trying to do:
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

Unsure of your exact expectations of what you are wanting besides just having a 2 lists inside a single list. Though I do hope this link will help answer your issue.
https://www.tutorialspoint.com/python_data_structure/python_2darray.htm?

Why Won't This Python Code match the Formula for a European Call Option?

import math
import numpy as np
S0 = 100.; K = 100.; T = 1.0; r = 0.05; sigma = 0.2
M = 100; dt = T / M; I = 500000
S = np.zeros((M + 1, I))
S[0] = S0
for t in range(1, M + 1):
z = np.random.standard_normal(I)
S[t] = S[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt + sigma *
math.sqrt(dt) * z)
C0 = math.exp(-r * T) * np.sum(np.maximum(S[-1] - K, 0)) / I
print ("European Option Value is ", C0)
It gives a value of around 10.45 as you increase the number of simulations, but using the B-S formula the value should be around 10.09. Anybody know why the code isn't giving a number closer to the formula?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having problems making this program run faster for large inputs - python

Related

How to speed up the computation that is slow even with Numba

np.int64 behaves differently from int in math-operations

How can I resolve this binomial equation by coding?

How to create a Single Vector having 2 Dimensions?

Why Won't This Python Code match the Formula for a European Call Option?

Categories

Resources