I have a square matrix S (160 x 160), and a huge matrix X (160 x 250000). Both are dense numpy arrays.
My goal: find Q such that Q = inv(chol(S)) * X, where chol(S) is the lower cholesky factorization of S.
Naturally, a simple solution is
cholS = scipy.linalg.cholesky( S, lower=True)
scipy.linalg.solve( cholS, X )
My problem: this solution is noticeably slower (>2x) in python than when I try the same in Matlab. Here are some timing experiments:
timeit np.linalg.solve( cholS, X)
1 loops, best of 3: 1.63 s per loop
timeit scipy.linalg.solve_triangular( cholS, X, lower=True)
1 loops, best of 3: 2.19 s per loop
timeit scipy.linalg.solve( cholS, X)
1 loops, best of 3: 2.81 s per loop
[matlab]
cholS \ X
0.675 s
[matlab using only one thread via -singleCompThread]
cholS \ X
1.26 s
Basically, I'd like to know: (1) can I reach Matlab-like speeds in python? and (2) why is the scipy version so slow?
The solver should be able to take advantage of the fact that chol(S) is triangular. However, using numpy.linalg.solve() is faster than scipy.linalg.solve_triangular(), even though the numpy call doesn't use the triangular structure at all. What gives? The matlab solver seems to auto-detect when my matrix is triangular, but python cannot.
I'd be happy to use a custom call to BLAS/LAPACK routines for solving triangular linear systems, but I really don't want to write that code myself.
For reference, I'm using scipy version 11.0 and the Enthought python distribution (which uses Intel's MKL library for vectorization), so I think I should be able to reach Matlab-like speeds.
TL;DR: Don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.
I found this thread after stumbling across some discrepancies between numpy.linalg.solve and scipy.linalg.solve (and scipy's lu_solve, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.
With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):
I see a significant difference between numpy.linalg.solve and scipy.linalg.solve when solving for a vector X (i.e., X is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.
However, most of the difference appears to be due to scipy's solve checking for invalid entries. When passing check_finite=False into scipy.linalg.solve, scipy's solve runtime is 1.02x numpy's.
Scipy's solve using destructive updates, i.e., with overwrite_a=True, overwrite_b=True is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with just check_finite=False has runtime 1.04x the destructive case. In summary, destructive scipy.linalg.solve is very slightly faster than either of these cases.
The above are for a vector X. If I make X a wide array, specifically 160 by 10000, scipy.linalg.solve with check_finite=False is essentially as fast as with check_finite=False, overwrite_a=True, overwrite_b=True. Scipy's solve (without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False) call. Numpy's solve has runtime 1.03x scipy's fastest for this array X case.
scipy.linalg.solve_triangular delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass in check_finite=False. The runtime for the fastest solve was 5.68x and 1.76x solve_triangular's, for vector and array X, respectively, with check_finite=False.
solve_triangular with destructive computation (overwrite_b=True) gives you no speedup on top of check_finite=False (and actually hurts slightly for the array X case).
I, ignoramus, was previously unaware of solve_triangular and was using scipy.linalg.lu_solve as a triangular solver, i.e., instead of solve_triangular(cholS, X) doing lu_solve((cholS, numpy.arange(160)), X) (both produce the same answer). But I discovered that lu_solve used in this way has runtime 1.07x unsafe solve_triangular for the vector X case, while its runtime was 1.76x for the array X case. I'm not sure why lu_solve is so much slower for array X, compared to vector X, but the lesson is to use solve_triangular (without infinite checks).
Copying the data to Fortran format didn't seem to matter at all. Neither does converting to numpy.matrix.
I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X case and 6.3x longer runtime for the fat matrix X case.
However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000 to disable the fat matrix X case and do the n=1 vector case. (Sorry.)
import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np
RNG = RandomState(69)
m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
Ac = np.triu(Ac)
bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")
if 0: # Save to Matlab format
import scipy.io as io
io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
import sys
sys.exit(0)
def lapper(fn, source, **kwargs):
Alocal = source[0].copy()
blocal = source[1].copy()
fn(Alocal, blocal,**kwargs)
laps = (1000 if n<=1 else 100)
def printer(t, s=''):
print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
return t/float(laps)
t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
"scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
"numpy.solve"))
#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))
print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve destructive"))
print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
check_finite=False), number=laps),
"scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve_triangular destructive"))
print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
(Ac,bc)), number=laps), "LU"))
print "all times:"
print t
Output of the above script for the vector case, n=1:
C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]
Output of the above script for the matrix case n=10000:
C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]
Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:
clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(#() A\b)
q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(#() A\b)
You'll need the timeit function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:
matrix_time =
0.0099989
vector_time =
2.2487e-05
The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.
Why not just use the equation: Q = inv(chol(S)) * X, here is my test:
import scipy.linalg
import numpy as np
N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)
cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)
np.allclose(Y1, Y2)
output:
True
Here is the time test:
%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)
output:
CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s
I don't know why scipy.linalg.solve_triangular is slower than numpy.linalg.solve on your system, but the inv version is the fastest.
A couple of things to try:
X = X.copy('F') # use fortran-order arrays, so that a copy is avoided
Y = solve_triangular(cholS, X, overwrite_b=True) # avoid another copy, but trash contents of X
Y = solve_triangular(cholS, X, check_finite=False) # Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...
With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.
I can't reproduce the issue with np.linalg.solve and scipy.linalg.solve having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.
Related
I was hoping to find some clever approaches to solving a parallel-processing problem I've been struggling with. Basically, I am dealing with 20,160 multidimensional arrays with size (72,35,25,20). Currently, I'm integrating out the dimension with size 72 by simply doing a trapezoidal integration in a nested for-loop. My end goal is to get an output array with size (20160,35,25,20).
for idx,filename in enumerate(filenames):
#Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
for i in range(35):
for j in range(25):
for k in range(20):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[idx,i,j,k]=omni_flux #This will have size (20160,35,25,20)
I believe it would be most beneficial to implement the parallelization lower in the nested for-loop but can't seem to figure out how. I have searched for common questions, but none [that I have found] provide enough insight into how to implement shared memory, pass multidimensional arrays to the pools, and/or reshape the resulting array. Any help or insight would be greatly appreciated.
You can use Numba so to speed up this code by a large margin. Numba is a JIT compiler that is able to compile Numpy-based code to fast native codes (so loops are not a issue with it, in fact this is a good idea to use loops in Numba).
The first thing to do is to pre-compute np.sin(PA) once so to avoid repeated computations. Then, dir_flux * np.sin(PA) can be computed using a for loop and the result can be stored in a pre-allocated array so not to perform millions of expensive small array allocations. The outer loop can be executed using multiple threads using prange and the Numba flag parallel=True. It can be further accelerated using the flag fastmath=True assuming the input values are not special (like NaN or Inf or very very small: see subnormal numbers).
While this should theoretically enough got get a fast code, the current implementation of np.trapz is not efficient as it performs expensive allocations. One can easily rewrite the function so not to allocated any additional arrays.
Here are the resulting code:
import numpy as np
import numba as nb
#nb.njit('(float64[::1], float64[::1])')
def trapz(y, x):
s = 0.0
for i in range(x.size-1):
dx = x[i+1] - x[i]
dy = y[i] + y[i+1]
s += dx * dy
return s * 0.5
#nb.njit('(float64[:,:,:,:], float64[:])', parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
flattenPA = np.ascontiguousarray(PA)
sinPA = np.sin(flattenPA)
for i in nb.prange(si):
tmp = np.empty(sl)
for j in range(sj):
for k in range(sk):
dir_flux = flux[:, i, j, k]
for l in range(sl):
tmp[l] = dir_flux[l] * sinPA[l]
omni_flux = trapz(tmp, flattenPA)
data[i, j, k] = omni_flux
return data
for idx,filename in enumerate(filenames):
# Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
data[idx] = compute(flux, PA)
Note flux and PA must be Numpy arrays. Also note that trapz is accurate as long as len(PA) is relatively small and np.std(PA) is not huge. Otherwise a pair-wise summation or even a (paranoid) Kahan summation should help (note Numpy use a pair-wise summation). In practice, results are the same on random normal numbers.
Further optimizations
The code can be made even faster by making flux accesses more contiguous. An efficient transposition can be used to do that (the one of Numpy is not efficient). However, this is not simple to do on 4D arrays. Another solution is to compute the trapz operation on whole lines of the k dimension. This makes the computation very efficient and nearly memory-bound on my machine. Here is the code:
#nb.njit('(float64[:,:,:,:], float64[:])', fastmath=True, parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
sinPA = np.sin(PA)
premultPA = PA * 0.5
for i in nb.prange(si):
for j in range(sj):
dir_flux = flux[:, i, j, :]
data[i, j, :].fill(0.0)
for l in range(sl-1):
dx = premultPA[l+1] - premultPA[l]
fact1 = dx * sinPA[l]
fact2 = dx * sinPA[l+1]
for k in range(sk):
data[i, j, k] += fact1 * dir_flux[l, k] + fact2 * dir_flux[l+1, k]
return data
Note the premultiplication make the computation slightly less precise.
Results
Here are results on random numbers (like #DominikStańczak used) on my 6-core machine (i5-9600KF processor):
Initial sequential solution: 193.14 ms (± 1.8 ms)
DominikStańczak sequential vectorized solution: 8.68 ms (± 48.1 µs)
Numba parallel solution without fastmath: 0.48 ms (± 6.7 µs)
Numba parallel solution without fastmath: 0.38 ms (± 9.5 µs)
Best Numba solution (with fastmath): 0.32 ms (± 5.2 µs)
Optimal lower-bound execution: 0.24 ms (RAM bandwidth saturation)
Thus, the fastest Numba version is 27 times faster than the (sequential) version of #DominikStańczak and 604 times faster than the initial one. It is nearly optimal.
As a first step, let's vectorize the code itself. I'm just going to deal with doing this on a per-file basis for now, to show you how to get rid of the nested for loop:
shape = (72, 35, 25, 20)
flux = np.random.normal(size=shape)
PA = np.random.normal(size=shape[0])
Now, timing your implementation, rewritten a little:
%%timeit
data = np.empty(shape[1:])
for i in range(shape[1]):
for j in range(shape[2]):
for k in range(shape[3]):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[i,j,k]=omni_flux
# 211 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My first idea was to pull sin out of the for loop, because there's no need to recalculate it each time, but that got me 10ms tops. However, if instead of the for loop, we used plain numpy vectorization via broadcasting, turning sin_PA into a (72, 1, 1, 1)-shaped array:
%%timeit
sin_PA = np.sin(PA).reshape(-1, 1, 1, 1)
data = np.trapz(flux * sin_PA, x=PA, axis=0)
# 9.03 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's a 20 times speed up, nothing to scoff at. I estimate it'd take about three minutes for all your files. You can also use np.allclose to verify the results agree up to floating point error.
If you still need to parallelize this afterwards, I would use dask.array In fact, if you've got your data in netcdf4 files, I would use xarray (which is helpful for multidimensional data anyway) to read those, and then run the trapz computations on that with Dask enabled in the backend. I think that's the simplest way to achieve easy multiprocessing in this case. Here's a quick sketch:
import xarray
from Dask.distributed import Client
client = Client()
file_data = xarray.open_mfdataset(filenames, parallel=True)
# massage the data a little, probably
flux = file_data["FluxHydrogen"]
PA = file_data["PitchAngleGrid"]
integrand = flux * np.sin(PA) # most element-wise numpy operations work on xarray ones or Dask based ones without a hitch
data = integrand.integrate(coord="PitchAngle") # or some such name for the dimension you're integrating out
My python code takes about 6.2 seconds to run. The Matlab code runs in under 0.05 seconds. Why is this and what can I do to speed up the Python code? Is Cython the solution?
Matlab:
function X=Test
nIter=1000000;
Step=.001;
X0=1;
X=zeros(1,nIter+1); X(1)=X0;
tic
for i=1:nIter
X(i+1)=X(i)+Step*(X(i)^2*cos(i*Step+X(i)));
end
toc
figure(1) plot(0:nIter,X)
Python:
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
x[0] = 1
start = time.time()
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
end = time.time()
print(end - start)
How to speed up your Python code
Your largest time sink is np.cos which performs several checks on the format of the input.
These are relevant and usually negligible for high-dimensional inputs, but for your one-dimensional input, this becomes the bottleneck.
The solution to this is to use math.cos, which only accepts one-dimensional numbers as input and thus is faster (though less flexible).
Another time sink is indexing x multiple times.
You can speed this up by having one state variable which you update and only writing to x once per iteration.
With all of this, you can speed up things by a factor of roughly ten:
import numpy as np
from math import cos
nIter = 1000000
Step = .001
x = np.zeros(1+nIter)
state = x[0] = 1
for i in range(nIter):
state += Step*state**2*cos(Step*i+state)
x[i+1] = state
Now, your main problem is that your truly innermost loop happens completely in Python, i.e., you have a lot of wrapping operations that eat up time.
You can avoid this by using uFuncs (e.g., created with SymPy’s ufuncify) and using NumPy’s accumulate:
import numpy as np
from sympy.utilities.autowrap import ufuncify
from sympy.abc import t,y
from sympy import cos
nIter = 1000000
Step = 0.001
state = x[0] = 1
f = ufuncify([y,t],y+Step*y**2*cos(t+y))
times = np.arange(0,nIter*Step,Step)
times[0] = 1
x = f.accumulate(times)
This runs practically within an instant.
… and why that’s not what you should worry about
If your exact code (and only that) is what you care about, then you shouldn’t worry about runtime anyway, because it’s very short either way.
If on the other hand, you use this to gauge efficiency for problems with a considerable runtime, your example will fail because it considers only one initial condition and is a very simple dynamics.
Moreover, you are using the Euler method, which is either not very efficient or robust, depending on your step size.
The latter (Step) is absurdly low in your case, yielding much more data than you probably need:
With a step size of 1, You can see what’s going on just fine.
If you want a robust integration in such cases, it’s almost always best to use a modern adaptive integrator, that can adjust its step size itself, e.g., here is a solution to your problem using a native Python integrator:
from math import cos
import numpy as np
from scipy.integrate import solve_ivp
T = 1000
dt = 0.001
x = solve_ivp(
lambda t,state: state**2*cos(t+state),
t_span = (0,T),
t_eval = np.arange(0,T,dt),
y0 = [1],
rtol = 1e-5
).y
This automatically adjusts the step size to something higher, depending on the error tolerance rtol.
It still returns the same amount of output data, but that’s via interpolation of the solution.
It runs in 0.3 s for me.
How to speed up things in a scalable manner
If you still need to speed up something like this, chances are that your derivative (f) is considerably more complex than in your example and thus it is the bottleneck.
Depending on your problem, you may be able to vectorise its calcultion (using NumPy or similar).
If you can’t vectorise, I wrote a module that specifically focusses on this by hard-coding your derivative under the hood.
Here is your example in with a sampling step of 1.
import numpy as np
from jitcode import jitcode,y,t
from symengine import cos
T = 1000
dt = 1
ODE = jitcode([y(0)**2*cos(t+y(0))])
ODE.set_initial_value([1])
ODE.set_integrator("dop853")
x = np.hstack([ODE.integrate(t) for t in np.arange(0,T,dt)])
This runs again within an instant. While this may not be a relevant speed boost here, this is scalable to huge systems.
The difference is jit-compilation, which Matlab uses per default. Let's try your example with Numba(a Python jit-compiler)
Code
import numba as nb
import numpy as np
import time
nIter = 1000000
Step = .001
#nb.njit()
def integrate(nIter,Step):
x = np.zeros(1+nIter)
x[0] = 1
for i in range(1,1+nIter):
x[i] = x[i-1] + Step*x[i-1]**2*np.cos(Step*(i-1)+x[i-1])
return x
#Avoid measuring the compilation time,
#this would be also recommendable for Matlab to have a fair comparison
res=integrate(nIter,Step)
start = time.time()
for i in range(100):
res=integrate(nIter,Step)
end=time.time()
print((end - start)/100)
This results in 0.022s runtime per call.
I am experiencing substantially slower matrix multiplication in R as compared to python. This is for large matrices. For example (in python):
import numpy as np
A = np.random.rand(4112, 23050).astype('float32')
B = np.random.rand(23050, 2500).astype('float32')
%timeit np.dot(A, B)
1 loops, best of 3: 1.09 s per loop
Here is the equivalent multiplication in R (takes almost 10x longer):
A <- matrix(rnorm(4112*23050), ncol = 23050)
B <- matrix(rnorm(23050*2500), ncol = 2500)
system.time(A %*% B)
user system elapsed
72.032 1.048 9.444
How can I achieve matrix multiplication speeds in R that are comparable to what is standard with python?
What I Have Already Tried:
1) Part of the descrepancy seems to be that python supports float32 whereas R only uses numeric, which is similar to (the same as?) float64. For example, the same python commands as above except with float64 takes twice as long (but still 5x slower than R):
import numpy as np
A = np.random.rand(4112, 23050).astype('float64')
B = np.random.rand(23050, 2500).astype('float64')
%timeit np.dot(A, B)
1 loops, best of 3: 2.24 s per loop
2) I am using the openBLAS linear algebra back-end for R.
3) RcppEigen as detailed in answer to this SO (see link for test.cpp file). The multiplication is about twice as fast in "user" time, but 3x slower in the more critical elapsed time as it only uses 1 of 8 threads.
library(Rcpp)
sourceCpp("test.cpp")
A <- matrix(rnorm(4112*23050), nrow = 4112)
B <- matrix(rnorm(23050*2500), ncol = 2500)
system.time(res <- eigenMatMult(A, B))
user system elapsed
29.436 0.056 29.551
I use MRO and python with anaconda and the MKL BLAS. Here are my results for the same data generating process, i.e. np.random.rand ('float64') or rnorm and identical dimensions (average and standard deviation over 10 replications ):
Python:
np.dot(A, B) # 1.3616 s (sd = 0.1776)
R:
Bt = t(B)
a = A %*% B # 2.0285 s (sd = 0.1897)
acp = tcrossprod(A, Bt) # 1.3098 s (sd = 0.1206)
identical(acp, a) # TRUE
Slightly tangential, but too long for a comment I think. To check whether the relevant compiler flags (e.g. -fopenmp) are set, use sourceCpp("testeigen.cpp",verbose=TRUE).
On my system, this showed that the OpenMP flags are not defined by default.
I did this to enable them (adapted from here):
library(Rcpp)
pkglibs <- "-fopenmp -lgomp"
pkgcxxflags <- "-fopenmp"
Sys.setenv(PKG_LIBS=pkglibs,PKG_CXXFLAGS=pkgcxxflags)
sourceCpp("testeigen.cpp",verbose=TRUE)
Dirk Eddelbuettel comments that he prefers to set the compiler flags in ~/.R/Makevars.
The example I took this from called the internal Rcpp:::RcppLdFlags and Rcpp:::RcppCxxFlags functions and prepended the results to the flags given above; this seems not to be necessary (?)
I have the following code in python:
def P(z, u0):
x = np.inner(z, u0)
tmp = x*u0
return (z - tmp)
def powerA2(A, u0):
x0 = np.random.rand(len(A))
for i in range(ITERATIONS):
x0 = P(np.dot(A, x0), u0)
x0 = x0 / np.linalg.norm(x0)
return (np.inner(np.dot(A, x0), x0))
np is numpy package.
I am interested in running this code for matrices in size of 100,000 * 100,000, but it seems that there is no chance for this program to run fast (I need to run it many times, about 10,000).
Is there any chance that tricks like multi-threading would work here?
Does anything else help to accelerate it?
You could consider using Pythran. Compiling the following code (norm.py):
#pythran export powerA2(float [][], float[])
import numpy as np
def P(z, u0):
x = np.inner(z, u0)
tmp = x*u0
return (z - tmp)
def norm(x):
return np.sqrt(np.sum(np.abs(x)**2))
def powerA2(A, u0):
ITERATIONS = 100
x0 = np.random.random(len(A))
for i in range(ITERATIONS):
x0 = P(np.dot(A, x0), u0)
x0 = x0 / norm(x0)
return (np.inner(np.dot(A, x0), x0))
with:
pythran norm.py
yields the following speedup:
$ python -m timeit -s 'import numpy as np; A = np.random.rand(100, 100); B = np.random.random(100); import norm' 'norm.powerA2(A, B)'
100 loops, best of 3: 3.1 msec per loop
$ pythran norm.py -O3 -march=native
$ python -m timeit -s 'import numpy as np; A = np.random.rand(100, 100); B = np.random.random(100); import norm' 'norm.powerA2(A, B)'
1000 loops, best of 3: 937 usec per loop
Just to check: you want to do 10^4 operations of something 10^10... so even if your operation is O(1), that's still 10^14 operations, which is a pretty hard problem (and as haraldkl pointed out in his comment, this is also eating a ton of memory) Just to check: are you going to call powerA2 10,000 times, or is 10,000 your desired value for ITERATIONS. If the former, you could use threads (or better yet, separate processes) to get some parallization but I don't know if that's going to be enough; if the latter, unless there's a trick I'm missing, your inputs don't seem as paralizable since the input for each loop iteration depend on the outputs of the previous. There may be a way to do this on GPU (I would like to think there'd be an efficient way to at least do the normalization bit such that it could do large numbers of stuff quickly by using vectorization)
Edit in response to comment: cpython (which is the most common python implementation) has a Global Interpeter Lock (GIL); some other python implementations (jython, ironpython) do not; per https://wiki.python.org/moin/GlobalInterpreterLock, .
Note that potentially blocking or long-running operations, such as
I/O, image processing, and NumPy number crunching, happen outside the
GIL. Therefore it is only in multithreaded programs that spend a lot
of time inside the GIL, interpreting CPython bytecode, that the GIL
becomes a bottleneck.
As far as I know, it should be possible to use threads with numpy and not be horribly bottlenecked but your problem still looks hard to convert to threads unless there's some bit of math I'm missing.
I get a 10% improvement over the uncompilled serge-sans-paille version by redefining functions this way:
def P0(z, u0):
x = np.inner(z, u0)
x *= u0
return (z - x)
def norm0(x):
return np.sqrt(np.sum(x*x))
def powerA20(A, u0):
ITERATIONS = 100
x0 = np.random.random(len(A))
for i in range(ITERATIONS):
x0 = P0(np.dot(A, x0), u0)
x0 /= norm0(x0)
return (np.inner(np.dot(A, x0), x0))
Doing things like *= u0 instead of x = x*u0 avoids unnecesary copies of the variables in RAM, speeding the program up a little bit.
Also, you don't need abs in that case. And finally, x*x is slightly faster than x**2.
I need to calculate the local statistics of a image depending on a 2D Window block defined by the user. Stats include : Mean, Variance, Skew, Kurtosis. I need to traverse through each pixel of the image and find the neighboring pixels depending on the window size.
The code that I used was:
scipy.ndimage.generic_filter(array,numpy.var,size=3)
But the performance through this is very low. I even tried strides-numpy but that too isn't showing much difference (wasn't able to compute skewness, kurtosis). I'm not familiar with Cython so have not ventured into that option.
So is there any other way to accomplish this without Cython?
The reason uniform_filter() is so much faster than generic_filter() is due to Python -- for generic_filter(), Python gets called for each pixel, while for uniform_filter(), the whole image is processed in native code. (I found OpenCV's boxFilter() even faster than uniform_filter(), see my answer to a "window variance" question.)
In the remainder of this answer, I show how to do a skew calculation using uniform_filter(), which dramatically speeds up a generic_filter()-based version such as:
import scipy.ndimage as ndimage, scipy.stats as st
ndimage.generic_filter(img, st.skew, size=(1,5))
SciPy's st.skew() (see, e.g., v0.17.0) appears to calculate the skew as
m3 / m2**1.5
where m3 = E[(X-m)**3] (the third central moment), m2 = E[(X-m)**2] (the variance), and m = E[X] (the mean).
To use uniform_filter(), one has to write this in terms of raw moments such as m3p = E[X**3] and m2p = E[X**2] (a prime symbol is usually used to distinguish the raw moment from the central one):
m3 = E[(X-m)**3] = ... = m3p - 3*m*m2p + 2*m**3
m2 = E[(X-m)**2] = ... = m2p - m*m
(In case my "..." skips too much, this answer has the full derivation for m2.) Then one can implement skew() using uniform_filter() (or boxFilter() for some additional speedup):
def winSkew(img, wsize):
imgS = img*img
m, m2p, m3p = (ndimage.uniform_filter(x, wsize) for x in (img, imgS, imgS*img))
mS = m*m
return (m3p-3*m*m2p+2*mS*m)/(m2p-mS)**1.5
Compared to generic_filter(), winSkew() gives a 654-fold speedup on the following example on my machine:
In [185]: img = np.random.randint(0, 256, (500,500)).astype(np.float)
In [186]: %timeit ndimage.generic_filter(img, st.skew, size=(1,5))
1 loops, best of 3: 14.2 s per loop
In [188]: %timeit winSkew(img, (1,5))
10 loops, best of 3: 21.7 ms per loop
And the two calculations give essentially identical results:
In [190]: np.allclose(winSkew(img, (1,5)), ndimage.generic_filter(img, st.skew, size=(1,5)))
Out[190]: True
The code for a Kurtosis calculation can be derived the same way.
The problem is that generic_filter() cannot assume that your filter is separable along the x or y axes. Thus it must operate as a true 2D filter rather than a series of two 1D filters, so run-time will be much slower.
The mean filter and is equivalent (I think) to the uniform_filter(), which if you read the documentation, is implemented as a series of two 1d uniform filters.
I compared timing via this code block:
import numpy as np
from scipy import ndimage as ndi
from scipy import misc
baboonfile = '/Users/curt/Downloads/BaboonRGB.jpg' #local download of http://read.pudn.com/downloads169/sourcecode/graph/texture_mapping/776733/Picture/BaboonRGB__.jpg
im = misc.imread(baboonfile)
meanfilt2D = ndi.generic_filter(im, np.mean, size=[3, 3, 1])
%timeit meanfilt2D = ndi.generic_filter(im, np.mean, size=[3, 3, 1])
print meanfilt2D.shape
meanfiltU = ndi.uniform_filter(im, size=[3, 3, 1])
%timeit meanfiltU = ndi.uniform_filter(im, size=[3, 3, 1])
print meanfiltU.shape
The output of that block was:
1 loops, best of 3: 5.22 s per loop
(512, 512, 3)
100 loops, best of 3: 11.8 ms per loop
(512, 512, 3)
so true two-dimensional generic_filter() takes 5 seconds for a small image but the two-pass 1D uniform_filter() takes only milliseconds. (N.B.: The difference image meanfilt2D-meanfiltU was not identically zero, but the maximum element was 2; I think the differences are caused by rounding and the imprecise datatype (uint8) used for im.)
For variance and other filters, you should see this old Stack Overflow post which answers a highly related question.