Following my former question [1], I would like to apply multiprocessing to matplotlib's griddata function. Is it possible to split the griddata into, say 4 parts, one for each of my 4 cores? I need this to improve performance.
For example, try the code below, experimenting with different values for size:
import numpy as np
import matplotlib.mlab as mlab
import time
size = 500
Y = np.arange(size)
X = np.arange(size)
x, y = np.meshgrid(X, Y)
u = x * np.sin(5) + y * np.cos(5)
v = x * np.cos(5) + y * np.sin(5)
test = x + y
tic = time.clock()
test_d = mlab.griddata(
x.flatten(), y.flatten(), test.flatten(), x+u, y+v, interp='linear')
toc = time.clock()
print 'Time=', toc-tic
I ran the example code below in Python 3.4.2, with numpy version 1.9.1 and matplotlib version 1.4.2, on a Macbook Pro with 4 physical CPUs (i.e., as opposed to "virtual" CPUs, which the Mac hardware architecture also makes available for some use cases):
import numpy as np
import matplotlib.mlab as mlab
import time
import multiprocessing
# This value should be set much larger than nprocs, defined later below
size = 500
Y = np.arange(size)
X = np.arange(size)
x, y = np.meshgrid(X, Y)
u = x * np.sin(5) + y * np.cos(5)
v = x * np.cos(5) + y * np.sin(5)
test = x + y
tic = time.clock()
test_d = mlab.griddata(
x.flatten(), y.flatten(), test.flatten(), x+u, y+v, interp='linear')
toc = time.clock()
print('Single Processor Time={0}'.format(toc-tic))
# Put interpolation points into a single array so that we can slice it easily
xi = x + u
yi = y + v
# My example test machine has 4 physical CPUs
nprocs = 4
jump = int(size/nprocs)
# Enclose the griddata function in a wrapper which will communicate its
# output result back to the calling process via a Queue
def wrapper(x, y, z, xi, yi, q):
test_w = mlab.griddata(x, y, z, xi, yi, interp='linear')
q.put(test_w)
# Measure the elapsed time for multiprocessing separately
ticm = time.clock()
queue, process = [], []
for n in range(nprocs):
queue.append(multiprocessing.Queue())
# Handle the possibility that size is not evenly divisible by nprocs
if n == (nprocs-1):
finalidx = size
else:
finalidx = (n + 1) * jump
# Define the arguments, dividing the interpolation variables into
# nprocs roughly evenly sized slices
argtuple = (x.flatten(), y.flatten(), test.flatten(),
xi[:,(n*jump):finalidx], yi[:,(n*jump):finalidx], queue[-1])
# Create the processes, and launch them
process.append(multiprocessing.Process(target=wrapper, args=argtuple))
process[-1].start()
# Initialize an array to hold the return value, and make sure that it is
# null-valued but of the appropriate size
test_m = np.asarray([[] for s in range(size)])
# Read the individual results back from the queues and concatenate them
# into the return array
for q, p in zip(queue, process):
test_m = np.concatenate((test_m, q.get()), axis=1)
p.join()
tocm = time.clock()
print('Multiprocessing Time={0}'.format(tocm-ticm))
# Check that the result of both methods is actually the same; should raise
# an AssertionError exception if assertion is not True
assert np.all(test_d == test_m)
and I got the following result:
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/matplotlib/tri/triangulation.py:110: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.self._neighbors)
Single Processor Time=8.495998
Multiprocessing Time=2.249938
I'm not really sure what is causing the "future warning" from triangulation.py (evidently my version of matplotlib did not like something about the input values that were originally provided for the question), but regardless, the multiprocessing does appear to achieve the desired speedup of 8.50/2.25 = 3.8, (edit: see comments) which is roughly in the neighborhood of about 4X that we would expect for a machine with 4 CPUs. And the assertion statement at the end also executes successfully, proving that the two methods get the same answer, so in spite of the slightly weird warning message, I believe that the code above is a valid solution.
EDIT: A commenter has pointed out that both my solution, as well as the code snippet posted by the original author, are likely using the wrong method, time.clock(), for measuring execution time; he suggests using time.time() instead. I think I'm also coming around to his point of view. (Digging into the Python documentation a bit further, I'm still not convinced that even this solution is 100% correct, as newer versions of Python appear to have deprecated time.clock() in favor of time.perf_counter() and time.process_time(). But regardless, I do agree that whether or not time.time() is absolutely the most correct way of taking this measurement, it's still probably more correct than what I had been using before, time.clock().)
Assuming the commenter's point is correct, then it means the approximately 4X speedup that I thought I had measured is in fact wrong.
However, that does not mean that the underlying code itself wasn't correctly parallelized; rather, it just means that parallelization didn't actually help in this case; splitting up the data and running on multiple processors didn't improve anything. Why would this be? Other users have pointed out that, at least in numpy/scipy, some functions run on multiple cores, and some do not, and it can be a seriously challenging research project for an end-user to try to figure out which ones are which.
Based on the results of this experiment, if my solution correctly achieves parallelization within Python, but no further speedup is observed, then I would suggest the simplest likely explanation is that matplotlib is probably also parallelizing some of its functions "under the hood", so to speak, in compiled C++ libraries, just like numpy/scipy already do. Assuming that's the case, then the correct answer to this question would be that nothing further can be done: further parallelizing in Python will do no good if the underlying C++ libraries are already silently running on multiple cores to begin with.
Related
I want to use the method of lines to solve the thin-film equation. I have implemented it (with gamma=mu=0) Matlab using ode15s and it seems to work fine:
N = 64;
x = linspace(-1,1,N+1);
x = x(1:end-1);
dx = x(2)-x(1);
T = 1e-2;
h0 = 1+0.1*cos(pi*x);
[t,h] = ode15s(#(t,y) thinFilmEq(t,y,dx), [0,T], h0);
function dhdt = thinFilmEq(t,h,dx)
phi = 0;
hxx = (circshift(h,1) - 2*h + circshift(h,-1))/dx^2;
p = phi - hxx;
px = (circshift(p,-1)-circshift(p,1))/dx;
flux = (h.^3).*px/3;
dhdt = (circshift(flux,-1) - circshift(flux,1))/dx;
end
The film just flattens after some time, and for large time the film should tend to h(t->inf)=1. I haven't done any rigorous check and convergence analysis, but at least the result looks promising after only spending less than 5 mins to code it.
I want to do the same thing in Python, and I tried the following:
import numpy as np
import scipy.integrate as spi
def thin_film_eq(t,h,dx):
print(t) # to check the current evaluation time for debugging
phi = 0
hxx = (np.roll(h,1) - 2*h + np.roll(h,-1))/dx**2
p = phi - hxx
px = (np.roll(p,-1) - np.roll(p,1))/dx
flux = h**3*px/3
dhdt = (np.roll(flux,-1) - np.roll(flux,1))/dx
return dhdt
N = 64
x = np.linspace(-1,1,N+1)[:-1]
dx = x[1]-x[0]
T = 1e-2
h0 = 1 + 0.1*np.cos(np.pi*x)
sol = spi.solve_ivp(lambda t,h: thin_film_eq(t,h,dx), (0,T), h0, method='BDF', vectorized=True)
I add a print statement inside the function so I can check the current progress of the program. For some reasons, it is taking very tiny time step and after waiting for a few minutes it is still stuck at t=3.465e-5, with dt smaller than 1e-10. (haven't finished yet by the time I finished typing this question, and it probably won't within any reasonable time). For the Matlab program, it is done within a second with only 14 time steps taken (I only specify the time span, and it outputs 14 time steps with everything else kept at default). I want to ask the following:
Have I done anything wrong which dramatically slows down the computation time for my Python code? What settings should I choose for the solve_ivp function call? One thing I'm not sure is if I do the vectorization properly. Also did I write the function in the correct way? I know this is a stiff ODE, but the ultra-small time step taken by
Is the difference really just down to the difference in the ode solver? scipy.integrate.solve_ivp(f, method='BDF') is the recommended substitute of ode15s according to the official numpy website. But for this particular example the performance difference is one second vs takes ages to solve. The difference is a lot bigger than I thought.
Are there other alternative methods I can try in Python for solving similar PDEs? (something along the line of finite difference/method of lines) I mean utilizing existing libraries, preferably those in scipy.
My python code takes about 6.2 seconds to run. The Matlab code runs in under 0.05 seconds. Why is this and what can I do to speed up the Python code? Is Cython the solution?
Matlab:
function X=Test
nIter=1000000;
Step=.001;
X0=1;
X=zeros(1,nIter+1); X(1)=X0;
tic
for i=1:nIter
X(i+1)=X(i)+Step*(X(i)^2*cos(i*Step+X(i)));
end
toc
figure(1) plot(0:nIter,X)
Python:
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
x[0] = 1
start = time.time()
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
end = time.time()
print(end - start)
How to speed up your Python code
Your largest time sink is np.cos which performs several checks on the format of the input.
These are relevant and usually negligible for high-dimensional inputs, but for your one-dimensional input, this becomes the bottleneck.
The solution to this is to use math.cos, which only accepts one-dimensional numbers as input and thus is faster (though less flexible).
Another time sink is indexing x multiple times.
You can speed this up by having one state variable which you update and only writing to x once per iteration.
With all of this, you can speed up things by a factor of roughly ten:
import numpy as np
from math import cos
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
state = x[0] = 1
for i in range(nIter):
state += Step*state**2*cos(Step*i+state)
x[i+1] = state
Now, your main problem is that your truly innermost loop happens completely in Python, i.e., you have a lot of wrapping operations that eat up time.
You can avoid this by using uFuncs (e.g., created with SymPy’s ufuncify) and using NumPy’s accumulate:
import numpy as np
from sympy.utilities.autowrap import ufuncify
from sympy.abc import t,y
from sympy import cos
nIter = 1000000
Step = 0.001
state = x[0] = 1
f = ufuncify([y,t],y+Step*y**2*cos(t+y))
times = np.arange(0,nIter*Step,Step)
times[0] = 1
x = f.accumulate(times)
This runs practically within an instant.
… and why that’s not what you should worry about
If your exact code (and only that) is what you care about, then you shouldn’t worry about runtime anyway, because it’s very short either way.
If on the other hand, you use this to gauge efficiency for problems with a considerable runtime, your example will fail because it considers only one initial condition and is a very simple dynamics.
Moreover, you are using the Euler method, which is either not very efficient or robust, depending on your step size.
The latter (Step) is absurdly low in your case, yielding much more data than you probably need:
With a step size of 1, You can see what’s going on just fine.
If you want a robust integration in such cases, it’s almost always best to use a modern adaptive integrator, that can adjust its step size itself, e.g., here is a solution to your problem using a native Python integrator:
from math import cos
import numpy as np
from scipy.integrate import solve_ivp
T = 1000
dt = 0.001
x = solve_ivp(
lambda t,state: state**2*cos(t+state),
t_span = (0,T),
t_eval = np.arange(0,T,dt),
y0 = [1],
rtol = 1e-5
).y
This automatically adjusts the step size to something higher, depending on the error tolerance rtol.
It still returns the same amount of output data, but that’s via interpolation of the solution.
It runs in 0.3 s for me.
How to speed up things in a scalable manner
If you still need to speed up something like this, chances are that your derivative (f) is considerably more complex than in your example and thus it is the bottleneck.
Depending on your problem, you may be able to vectorise its calcultion (using NumPy or similar).
If you can’t vectorise, I wrote a module that specifically focusses on this by hard-coding your derivative under the hood.
Here is your example in with a sampling step of 1.
import numpy as np
from jitcode import jitcode,y,t
from symengine import cos
T = 1000
dt = 1
ODE = jitcode([y(0)**2*cos(t+y(0))])
ODE.set_initial_value([1])
ODE.set_integrator("dop853")
x = np.hstack([ODE.integrate(t) for t in np.arange(0,T,dt)])
This runs again within an instant. While this may not be a relevant speed boost here, this is scalable to huge systems.
The difference is jit-compilation, which Matlab uses per default. Let's try your example with Numba(a Python jit-compiler)
Code
import numba as nb
import numpy as np
import time
nIter = 1000000
Step = .001
#nb.njit()
def integrate(nIter,Step):
x = np.zeros(1+nIter)
x[0] = 1
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
return x
#Avoid measuring the compilation time,
#this would be also recommendable for Matlab to have a fair comparison
res=integrate(nIter,Step)
start = time.time()
for i in range(100):
res=integrate(nIter,Step)
end=time.time()
print((end - start)/100)
This results in 0.022s runtime per call.
I have two small issues regarding my code in Python 2.7, and I would like to see your input on these issues:
Lesser Issue: So I've been testing a Python 2.7 script designed to output a degree heading after given a distance traveled around an ellipse. However, I need the program to run within 15 seconds to the full extent. Currently, the program outputs 31.1410000324 seconds for completion which isn't terrible considering that it would be possible for me to divide the sample size into a little less than half.
Main Issue: However, here is the other problem: at 3 specific points in the program, the result being output is not correct. One time it outputs 18.47700786, and the other two output 37.90802974. Why exactly is this happening? When I run all 3 of them normally, they all output the correct values, but at full speed, they appear to give strange values even though the values are consistent every time. I would like to know what you think.
Here is the script:
import numpy as np
from scipy.integrate import quad
from scipy.optimize import minimize
import time
a = 2; b = 1
Dx = lambda t: -a * np.sin(t)
Dy = lambda t: b * np.cos(t)
encoderdistance = 0.00001
def func(x): return np.sqrt(Dx(x)**2 + Dy(x)**2)
time1 = time.time()
for i in range(2423):
r_angle = (((180/np.pi) * np.arctan(((np.tan(minimize(lambda x:
abs(quad(func, 0, x)[0] - encoderdistance), 1).x))*b)/a)))
print r_angle[0]
encoderdistance += 0.001
time2 = time.time()
print time2-time1, "seconds for completion"
And here is a graph of the output code:
I would like to use scipy.integrate.ode (or scipy.integrate.odeint) instances in multiple threads (one for each CPU core) in order to solve multiple IVPs at a time. However the documentation says: "This integrator is not re-entrant. You cannot have two ode instances using the “vode” integrator at the same time."
(Also odeint causes internal errors if instantiated multiple times although the documentation does not say so.)
Any idea what can be done?
One option is to use multiprocessing (i.e. use processes instead of threads). Here's an example that uses the map function of the multiprocessing.Pool class.
The function solve takes a set of initial conditions and returns a solution generated by odeint. The "serial" version of the code in the main section calls solve repeatedly, once for each set of initial conditions in ics. The "multiprocessing" version uses the map function of a multiprocessing.Pool instance to run several processes simultaneously, each calling solve. The map function takes care of doling out the arguments to solve.
My computer has four cores, and as I increase num_processes, the speedup maxes out at about 3.6.
from __future__ import division, print_function
import sys
import time
import multiprocessing as mp
import numpy as np
from scipy.integrate import odeint
def lorenz(q, t, sigma, rho, beta):
x, y, z = q
return [sigma*(y - x), x*(rho - z) - y, x*y - beta*z]
def solve(ic):
t = np.linspace(0, 200, 801)
sigma = 10.0
rho = 28.0
beta = 8/3
sol = odeint(lorenz, ic, t, args=(sigma, rho, beta), rtol=1e-10, atol=1e-12)
return sol
if __name__ == "__main__":
ics = np.random.randn(100, 3)
print("multiprocessing:", end='')
tstart = time.time()
num_processes = 5
p = mp.Pool(num_processes)
mp_solutions = p.map(solve, ics)
tend = time.time()
tmp = tend - tstart
print(" %8.3f seconds" % tmp)
print("serial: ", end='')
sys.stdout.flush()
tstart = time.time()
serial_solutions = [solve(ic) for ic in ics]
tend = time.time()
tserial = tend - tstart
print(" %8.3f seconds" % tserial)
print("num_processes = %i, speedup = %.2f" % (num_processes, tserial/tmp))
check = [(sol1 == sol2).all()
for sol1, sol2 in zip(serial_solutions, mp_solutions)]
if not all(check):
print("There was at least one discrepancy in the solutions.")
On my computer, the output is:
multiprocessing: 6.904 seconds
serial: 24.756 seconds
num_processes = 5, speedup = 3.59
SciPy.integrate.ode appears to use the LLNL SUNDIALS solvers, although SciPy doesn't say so explicitly, but they should, in my opinion.
The current version of the CVODE ode solver, 3.2.2, is re-entrant, which means that it can be used to solve multiple problems concurrently. The relevant information appears in User Documentation for CVODE v3.2.0 (SUNDIALS v3.2.0).
All state information used by cvode to solve a given problem is saved in a structure, and a pointer
to that structure is returned to the user. There is no global data in the cvode package, and so, in this
respect, it is reentrant. State information specific to the linear solver is saved in a separate structure,
a pointer to which resides in the cvode memory structure. The reentrancy of cvode was motivated
by the anticipated multicomputer extension, but is also essential in a uniprocessor setting where two
or more problems are solved by intermixed calls to the package from within a single user program.
But I don't know whether SciPy.integrate.ode, or other ode solvers like scikits.odes.ode, support this concurrency.
I'm studying dynamical systems, particularly the logistic family g(x) = cx(1-x), and I need to iterate this function an arbitrary amount of times to understand its behavior. I have no problem iterating the function given a specific point x_0, but again, I'd like to graph the entire function and its iterations, not just a single point. For plotting a single function, I have this code:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
def logplot(c, n = 10):
dt = .001
x = np.arange(0,1.001,dt)
y = c*x*(1-x)
plt.plot(x,y)
plt.axis([0, 1, 0, c*.25 + (1/10)*c*.25])
plt.show()
I suppose I could tackle this by the lengthy/daunting method of explicitly creating a list of the range of each iteration using something like the following:
def log(c,x0):
return c*x0*(1-x)
def logiter(c,x0,n):
i = 0
y = []
while i <= n:
val = log(c,x0)
y.append(val)
x0 = val
i += 1
return y
But this seems really cumbersome and I was wondering if there were a better way. Thanks
Some different options
This is really a matter of style. Your solution works and is not very difficult to understand. If you want to go on on those lines, then I would just tweak it a bit:
def logiter(c, x0, n):
y = []
x = x0
for i in range(n):
x = c*x*(1-x)
y.append(x)
return np.array(y)
The changes:
for loop is easier to read than a while loop
x0 is not used in the iteration (this adds one more variable, but it is mathematically easier to understand; x0 is a constant)
the function is written out, as it is a very simple one-liner (if it weren't, its name should be changed to be something else than log, which is very easy to confuse with logarithm)
the result is converted into a numpy array. (Just what I usually do, if I need to plot something)
In my opinion the function is now legible enough.
You might also take an object-oriented approach and create a logistic function object:
class Logistics():
def __init__(self, c, x0):
self.x = x0
self.c = c
def next_iter(self):
self.x = self.c * self.x * (1 - self.x)
return self.x
Then you may use this:
def logiter(c, x0, n):
l = Logistics(c, x0)
return np.array([ l.next_iter() for i in range(n) ])
Or if you may make it a generator:
def log_generator(c, x0):
x = x0
while True:
x = c * x * (1-x)
yield x
def logiter(c, x0, n):
l = log_generator(c, x0)
return np.array([ l.next() for i in range(n) ])
If you need performance and have large tables, then I suggest:
def logiter(c, x0, n):
res = np.empty((n, len(x0)))
res[0] = c * x0 * (1 - x0)
for i in range(1,n):
res[i] = c * res[i-1] * (1 - res[i-1])
return res
This avoids the slowish conversion into np.array and some copying of stuff around. The memory is allocated only once, and the expensive conversion from a list into an array is avoided.
(BTW, if you returned an array with the initial x0 as the first row, the last version would look cleaner. Now the first one has to be calculated separately if copying the vector around is desired to be avoided.)
Which one is best? I do not know. IMO, all are readable and justified, it is a matter of style. However, I speak only very broken and poor Pythonic, so there may be good reasons why still something else is better or why something of the above is not good!
Performance
About performance: With my machine I tried the following:
logiter(3.2, linspace(0,1,1000), 10000)
For the first three approaches the time is essentially the same, approximately 1.5 s. For the last approach (preallocated array) the run time is 0.2 s. However, if the conversion from a list into an array is removed, the first one runs in 0.16 s, so the time is really spent in the conversion procedure.
Visualization
I can think of two useful but quite different ways to visualize the function. You mention that you will have, say, 100 or 1000 different x0's to start with. You do not mention how many iterations you want to have, but maybe we will start with just 100. So, let us create an array with 100 different x0's and 100 iterations at c = 3.2.
data = logiter(3.6, np.linspace(0,1,100), 100)
In a way a standard method to visualize the function is draw 100 lines, each of which represents one starting value. That is easy:
import matplotlib.pyplot as plt
plt.plot(data)
plt.show()
This gives:
Well, it seems that all values end up oscillating somewhere, but other than that we have only a mess of color. This approach may be more useful, if you use a narrower range of values for x0:
data = logiter(3.6, np.linspace(0.8,0.81,100), 100)
you may color-code the starting values by e.g.:
color1 = np.array([1,0,0])
color2 = np.array([0,0,1])
for i,k in enumerate(np.linspace(0, 1, data.shape[1])):
plt.plot(data[:,i], '.', color=(1-k)*color1 + k*color2)
This plots the first columns (corresponding to x0 = 0.80) in red and the last columns in blue and uses a gradual color change in between. (Please note that the more blue a dot is, the later it is drawn, and thus blues overlap reds.)
However, it is possible to take a quite different approach.
data = logiter(3.6, np.linspace(0,1,1000), 50)
plt.imshow(data.T, cmap=plt.cm.bwr, interpolation='nearest', origin='lower',extent=[1,21,0,1], vmin=0, vmax=1)
plt.axis('tight')
plt.colorbar()
gives:
This is my personal favourite. I won't spoil anyone's joy by explaining it too much, but IMO this shows many peculiarities of the behaviour very easily.
Here's what I was aiming for; an indirect approach to understanding (by visualization) the behavior of initial conditions of the function g(c, x) = cx(1-x):
def jam(c, n):
x = np.linspace(0,1,100)
y = c*x*(1-x)
for i in range(n):
plt.plot(x, y)
y = c*y*(1-y)
plt.show()