Python: Comparing the speed of NumPy and SymPy ufuncified functions - python

Just wrote a code for comparison of speed of calculation for a function which is in written as numpy and a function which uses ufuncify from sympy:
import numpy as np
from sympy import symbols, Matrix
from sympy.utilities.autowrap import ufuncify
u,v,e,a1,a0 = symbols('u v e a1 a0')
dudt = u-u**3-v
dvdt = e*(u-a1*v-a0)
p = {'a1':0.5,'a0':1.5,'e':0.1}
eqs = Matrix([dudt,dvdt])
numeqs=eqs.subs([(a1,p['a1']),(a0,p['a0']),(e,p['e'])])
print eqs
print numeqs
dudt = ufuncify([u,v],numeqs[0])
dvdt = ufuncify([u,v],numeqs[1])
def syrhs(u,v):
return dudt(u,v),dvdt(u,v)
def nprhs(u,v,p):
dudt = u-u**3-v
dvdt = p['e']*(u-p['a1']*v-p['a0'])
return dudt,dvdt
def compare(n=10000):
import time
timer_np=0
timer_sy=0
error = np.zeros(n)
for i in range(n):
u=np.random.random((128,128))
v=np.random.random((128,128))
start_time=time.time()
npcalc=np.ravel(nprhs(u,v,p))
mid_time=time.time()
sycalc=np.ravel(syrhs(u,v))
end_time=time.time()
timer_np+=(mid_time-start_time)
timer_sy+=(end_time-mid_time)
error[i]=np.max(np.abs(npcalc-sycalc))
print "Max difference is ",np.max(error), ", and mean difference is ",np.mean(error)
print "Average speed for numpy ", timer_np/float(n)
print "Average speed for sympy ", timer_sy/float(n)
On my machine the result is:
In [21]: compare()
Max difference is 5.55111512313e-17 , and mean difference is 5.55111512313e-17
Average speed for numpy 0.00128133814335
Average speed for sympy 0.00127074036598
Any suggestions on how to make either of the above function faster is welcome!

After further exploration it seems that ufuncify and regular numpy functions will give more or less the same speed of computation. Using numba or printing to theano function did not result in a faster code. So the other option to make things faster is either cython or wrapping a c or FORTRAN code.

Related

Solving equation containing integrals with python

I'm currently trying to solve the following equation for x:
3.17e-2 - integral from x to 215 of [(10.^(8.64/x) / (480.1 - 10.^(4.32/x))^2)]dx = 0.
(sorry for writing the equation in such a crude way, I wasn't sure on how to insert latex on here)
so far I've come up with this:
import scipy as s
from scipy.integrate import odeint,quad
import numpy as np
def f(x):
fpe = 40
k = 1.26e4*fpe**2/4.2e4
return 10.**(8.64/x) / (k - 10.**(4.32/x))**2
def intf(x):
for i in x:
if 3.17e-2 - quad(lambda i:f(i),i,215) == 0.:
print(i)
xi = np.linspace(0.01, 5, 1000)
intf(xi)
However, I keep getting the following error:
OverflowError: (34, 'Result too large')
As you can imagine, this is not the result I was expecting. Do you reckon that this is only due to the result being too large or could there be something wrong with the code?
One thing you have to change quad returns a tuple (y, abserr), the result of the integral is quad(...)[0]
Also, if you compare f(x) == 0 you will only detect exact solutions, that will be impossible for this function in floating point computation. You could use abs(f(x)) < ytol, or simply use a zero finding method. I would suggest you to use fsolve
Another thing is that you have the derivative of the function, so you can pass that to the fsolve as well, putting all together you have
import numpy as np
from scipy.integrate import quad
from scipy.optimize import fsolve
def fprime(x):
fpe = 40
k = 1.26e4*fpe**2/4.2e4
return 10.**(8.64/x) / (k - 10.**(4.32/x))**2
def f(x):
try:
return np.array([f(i) for i in x])
except TypeError:
return 3.17e-2 - quad(lambda i:fprime(i),x,215)[0]
from scipy.optimize import fsolve
x0 = fsolve(f, 1, fprime=fprime)
this gives x0=2.03740802, and f(x0) = 2.35922393e-16

How to implement multiprocessing in Monte Carlo integration

I created a Python program that integrates a given function over a given interval using Monte Carlo simulation. It works well, except for the fact that it runs painfully slow when you want higher levels of accuracy (larger N value). I figured I'd give multiprocessing a try in order to speed it up, but then I realized I have no clue how to implement it. Here's what I have right now:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Process
import os
# GOAL: Approximate the integral of a function f(x) from lower bound a to upper bound b using Monte Carlo simulation
# bounds of integration
a = 0
b = np.pi
# function to integrate
def f(x):
return np.sin(x)
N = 10000
areas = []
def mcIntegrate():
for i in range(N):
# array filled with random numbers between limits
xrand = random.uniform(a, b, N)
# sum the return values of the function of each random number
integral = 0.0
for i in range(N):
integral += f(xrand[i])
# scale integral by difference of bounds divided by amount of random values
ans = integral * ((b - a) / float(N))
# add approximation to list of other approximations
areas.append(ans)
if __name__ == "__main__":
processes = []
numProcesses = os.cpu_count()
for i in range(numProcesses):
process = Process(target=mcIntegrate)
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.start()
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()
Can I get some help with this implementation?
Took advice from the comments and used multiprocessor.Pool, and also cut down on some operations by using NumPy instead. Went from taking about 5min to run to now about 6sec (for N = 10000). Here's my implementation:
import scipy
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import os
# GOAL: Approximate the integral of function f from lower bound a to upper bound b using Monte Carlo simulation
a = 0 # lower bound of integration
b = np.pi # upper bound of integration
f = np.sin # function to integrate
N = 10000 # sample size
def mcIntegrate(p):
xrand = scipy.random.uniform(a, b, N) # create array filled with random numbers within bounds
integral = np.sum(f(xrand)) # sum return values of function of each random number
approx = integral * ((b - a) / float(N)) # scale integral by difference of bounds divided by sample size
return approx
if __name__ == "__main__":
# run simulation N times in parallel and store results in array
with multiprocessing.Pool(os.cpu_count()) as pool:
areas = pool.map(mcIntegrate, range(N))
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()
This turned out to be a more interesting problem than I thought it would when I got to optimising it. The basic method is very simple:
from multiprocessing import pool
def f(x):
return x
results = pool.map(f, range(100))
Here is your mcIntegerate adapted for multiprocessing:
from tqdm import tqdm
def mcIntegrate(steps):
tasks = []
print("Setting up simulations")
# linear
for _ in tqdm(range(steps)):
xrand = random.uniform(a, b, steps)
for i in range(steps):
tasks.append(xrand[i])
pool = Pool(cpu_count())
print("Simulating (no progress)")
results = pool.map(f, tasks)
pool.close()
print("summing")
areas = []
for chunk in tqdm(range(steps)):
vals = results[chunk * steps : (chunk + 1) * steps]
integral = sum(vals)
ans = integral * ((b - a) / float(steps))
areas.append(ans)
return areas
tqdm is just used to display a progress bar.
This is the basic workflow for multiprocessing: break the question up into tasks, solve all the tasks, then add them all back together again. And indeed the code as given works. (Note that I've changed your N for steps).
For completeness, the script now begins:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
# function to integrate
def f(x):
return np.sin(x)
and ends
areas = mcIntegrate(3_000)
a = 0
b = np.pi
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec="black")
plt.xlabel("Areas")
plt.show()
Optimisation
I deliberately split the problem up at the smallest possible level. Was this a good idea? To answer that, consider: how might we optimise the linear process of generating the tasks? This does take a considerable while at the moment. We could parallelise it:
def _prepare(steps):
xrand = random.uniform(a, b, steps)
return [xrand[i] for i in range(steps)]
def mcIntegrate(steps):
...
tasks = []
for res in tqdm(pool.imap(_prepare, (steps for _ in range(steps))), total=steps):
tasks += res # slower except for very large steps
Here I've used pool.imap, which returns an iterator which we can iterate as soon as the results are available, allowing us to build a progress bar. If you do this and compare, you will see that it runs slower than the linear solution. Removing the progress bar (on my machine) and replace with:
import time
start = time.perf_counter()
results = pool.map(_prepare, (steps for _ in range(steps)))
tasks = []
for res in results:
tasks += res
print(time.perf_counter() - start)
Is only marginally faster: it's still slower than running linear. Serialising data to a process and then deserialising it has an overhead. If you try to get a progress bar on the whole thing, it becomes excruciatingly slow:
results = []
for result in tqdm(pool.imap(f, tasks), total=len(tasks)):
results.append(result)
So what about iterating at a higher level? Here's another adaption of your mcIterate:
a = 0
b = np.pi
def _mcIntegrate(steps):
xrand = random.uniform(a, b, steps)
integral = 0.0
for i in range(steps):
integral += f(xrand[i])
ans = integral * ((b - a) / float(steps))
return ans
def mcIntegrate(steps):
areas = []
p = Pool(cpu_count())
for ans in tqdm(p.imap(_mcIntegrate, ((steps) for _ in range(steps))), total=steps):
areas.append(ans)
return areas
This, on my machine, is considerably faster. It's also considerably simpler. I was expecting a difference, but not such a considerable difference.
Takeaways
Multiprocessing isn't free. Something as simple as np.sin() is too cheap to multprocess: we pay to serialise, deserialise, append, and so on, all for one sin() calculation. But if you do too many calculations, you will waste time as you lose granularity. Here the effect is more striking than I was expecting. The only way to know the right level of granularity for a particular problem... is to profile and try.
My experience is that multiprocessing is often not very efficient (a ton of overhead). The more you push your code into numpy the faster it will be, with one caveat; you can overload your memory if you're not careful (10k x 10k is getting large). Lastly, it looks like N is doing double duty, both defining sample size for each estimate, and also serving as the number of trial estimates.
Here is how I would do this (with minor style changes):
import numpy as np
f = np.sin
a = 0
b = np.pi
# number samples for each trial, trial count, and number calculated at once
N = 10000
TRIALS = 10000
BATCH_SIZE=1000
def mc_integrate(f, a, b, N, batch_size=BATCH_SIZE):
# compute everything carrying `batch_size` copies by extending the array dimension.
# samples.shape == (N, batch_size)
samples = np.random.uniform(a, b, size=(N, batch_size))
integrals = np.sum(f(samples), axis=0)
mc_estimates = integrals * ((b - a) / N)
return mc_estimates
# loop over batch values to get final result
n, r = divmod(TRIALS, BATCH_SIZE)
results = []
for j in [BATCH_SIZE]*n + [r]:
results.extend(mc_integrate(f, a, b, N, batch_size=j))
On my machine this takes a few seconds.

Optimize a distance matrix calculation

Im trying to calculate a matrix distance from a Fourier transformation for the first two components. The matrix is a 40k by 40k and the way im doing it is extremely slow. Is there a way to calculate the matrix is a more efficient faster way?
import numpy as np
from scipy.linalg import dft
#Transform the data using Fourier Transform.
ft = norm_data.dot(dft(8).transpose())/sqrt(8)
def ft_distance_calc(x,y):
temp = np.zeros((x,y))
for i in range(x):
for z in range(y):
temp[i,z] = sqrt(np.square(abs(ft[i,0:2] - ft[z,0:2])).sum())
return temp
ft_distance = ft_distance_calc(40000,40000)
You can use built-in functions for it:
from scipy.spatial.distance import cdist
def ft_distance_calc_2(x,y):
return cdist(ft[:x,0:2],ft[:y,0:2])
Comparison using benchit:
#OP's solution
def ft_distance_calc(x,y):
temp = np.zeros((x,y))
for i in range(x):
for z in range(y):
temp[i,z] = np.sqrt(np.square(abs(ft[i,0:2] - ft[z,0:2])).sum())
return temp
##Ehsan's solution
def ft_distance_calc_2(x,y):
return cdist(ft[:x,0:2],ft[:y,0:2])
##Quang's solution
def dist_cal(x,y):
return np.sqrt(np.square(ft[:x,None, :2]-ft[None, :y, :2]).sum(-1))
ft = np.random.rand(1000,2)
in_ = {n:[n, n] for n in [10,100,1000]}
Seems like ft_distance_calc_2 is the fastest.
How about a broadcasting
def dist_cal(x,y):
return np.sqrt(np.square(ft[:x,None, :2]-ft[None, :y, :2]).sum(-1))
# test
a = ft_distance_calc(400,200)
b = dist_cal(400,200)
(np.abs(a-b) < 1e-6).all()
# True

Python - Using a Kronecker Delta with ODEINT

I'm trying to plot the output from an ODE using a Kronecker delta function which should only become 'active' at a specific time = t1.
This should give a sawtooth like response where the initial value decays down exponentially until t=t1 where it rises again instantly before decaying down once again.
However, when I plot this it looks like the solver is seeing the Kronecker delta function as zero for all time t. Is there anyway to do this in Python?
from scipy import KroneckerDelta
import scipy.integrate as sp
import matplotlib.pyplot as plt
import numpy as np
def dy_dt(y,t):
dy_dt = 500*KroneckerDelta(t,t1) - 2y
return dy_dt
t1 = 4
y0 = 500
t = np.arrange(0,10,0.1)
y = sp.odeint(dy_dt,y0,t)
plt.plot(t,y)
In the case of a simple Kronecker delta using time, you can run the ode in pieces like so:
from scipy.integrate import odeint
import matplotlib.pyplot as plt
import numpy as np
def dy_dt(y,t):
return -2*y
t_delta = 4
tend = 10
y0 = [500]
t1 = np.linspace(0,t_delta,50)
y1 = odeint(dy_dt,y0,t1)
y0 = y1[-1] + 500 # execute Kronecker delta
t2 = np.linspace(t_delta,tend,50)
y2 = odeint(dy_dt,y0,t2)
t = np.append(t1, t2)
y = np.append(y1, y2)
plt.plot(t,y)
Another option for complicated situations is to the events functionality of solve_ivp.
I think the problem could be internal rounding errors, because 0.1 cannot be represented exactly as a python float. I would try
import math
def dy_dt(y,t):
if math.isclose(t, t1):
return 500 - 2*y
else:
return -2y
Also the documentation of odeint suggests using the args parameter instead of global variables to give your derivative function access to additional arguments and replacing np.arange by np.linspace:
import scipy.integrate as sp
import matplotlib.pyplot as plt
import numpy as np
import math
def dy_dt(y, t, t1):
if math.isclose(t, t1):
return 500 - 2*y
else:
return -2*y
t1 = 4
y0 = 500
t = np.linspace(0, 10, num=101)
y = sp.odeint(dy_dt, y0, t, args=(t1,))
plt.plot(t, y)
I did not test the code so tell me if there is anything wrong with it.
EDIT:
When testing my code I took a look at the t values for which dy_dt is evaluated. I noticed that odeint does not only use the t values that where specified, but alters them slightly:
...
3.6636447422787928
3.743098503914526
3.822552265550259
3.902006027185992
3.991829287543431
4.08165254790087
4.171475808258308
...
Now using my method, we get
math.isclose(3.991829287543431, 4) # False
because the default tolerance is set to a relative error of at most 10^(-9), so the odeint function "misses" the bump of the derivative at 4. Luckily, we can fix that by specifying a higher error threshold:
def dy_dt(y, t, t1):
if math.isclose(t, t1, abs_tol=0.01):
return 500 - 2*y
else:
return -2*y
Now dy_dt is very high for all values between 3.99 and 4.01. It is possible to make this range smaller if the num argument of linspace is increased.
TL;DR
Your problem is not a problem of python but a problem of numerically solving an differential equation: You need to alter your derivative for an interval of sufficient length, otherwise the solver will likely miss the interesting spot. A kronecker delta does not work with numeric approaches to solving ODEs.

Not able to automatically parallelize for loop with numba

I am trying to run the following on multiple cores for speed up using numba. Unfortunately the function seems to run only on one core when I tested it. Can someone explain to me why and if there is a possibility to get it running on multiple cores?
Minimal working example:
import numpy as np
import numba
a = np.random.rand(100000)
#numba.jit(nopython=True, parallel=True)
def func(x):
result = np.zeros_like(x)
for delta in range(1,len(x)):
thisresult = 0
for i in range(delta,len(x)):
thisresult += (x[i] - x[i-delta])**2
result[delta] = thisresult / (len(x) - delta)
return result
print(func(a))
Explicit Parallelization
I would always remommend to parallelize code explicitely. Numba tries to serialize some parallel code parts, but that won't always work or lead to best performance.
import numpy as np
import numba
a = np.random.rand(100000)
#numba.jit(nopython=True, parallel=True)
def func(x):
result = np.zeros_like(x,dtype=x.dtype)
for delta in numba.prange(1,len(x)):
thisresult = 0
for i in range(delta,len(x)):
thisresult += (x[i] - x[i-delta])**2
result[delta] = thisresult / (len(x) - delta)
return result
print(func(a))
For more details have a look at the documentation.

Categories