I was hoping to find some clever approaches to solving a parallel-processing problem I've been struggling with. Basically, I am dealing with 20,160 multidimensional arrays with size (72,35,25,20). Currently, I'm integrating out the dimension with size 72 by simply doing a trapezoidal integration in a nested for-loop. My end goal is to get an output array with size (20160,35,25,20).
for idx,filename in enumerate(filenames):
#Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
for i in range(35):
for j in range(25):
for k in range(20):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[idx,i,j,k]=omni_flux #This will have size (20160,35,25,20)
I believe it would be most beneficial to implement the parallelization lower in the nested for-loop but can't seem to figure out how. I have searched for common questions, but none [that I have found] provide enough insight into how to implement shared memory, pass multidimensional arrays to the pools, and/or reshape the resulting array. Any help or insight would be greatly appreciated.
You can use Numba so to speed up this code by a large margin. Numba is a JIT compiler that is able to compile Numpy-based code to fast native codes (so loops are not a issue with it, in fact this is a good idea to use loops in Numba).
The first thing to do is to pre-compute np.sin(PA) once so to avoid repeated computations. Then, dir_flux * np.sin(PA) can be computed using a for loop and the result can be stored in a pre-allocated array so not to perform millions of expensive small array allocations. The outer loop can be executed using multiple threads using prange and the Numba flag parallel=True. It can be further accelerated using the flag fastmath=True assuming the input values are not special (like NaN or Inf or very very small: see subnormal numbers).
While this should theoretically enough got get a fast code, the current implementation of np.trapz is not efficient as it performs expensive allocations. One can easily rewrite the function so not to allocated any additional arrays.
Here are the resulting code:
import numpy as np
import numba as nb
#nb.njit('(float64[::1], float64[::1])')
def trapz(y, x):
s = 0.0
for i in range(x.size-1):
dx = x[i+1] - x[i]
dy = y[i] + y[i+1]
s += dx * dy
return s * 0.5
#nb.njit('(float64[:,:,:,:], float64[:])', parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
flattenPA = np.ascontiguousarray(PA)
sinPA = np.sin(flattenPA)
for i in nb.prange(si):
tmp = np.empty(sl)
for j in range(sj):
for k in range(sk):
dir_flux = flux[:, i, j, k]
for l in range(sl):
tmp[l] = dir_flux[l] * sinPA[l]
omni_flux = trapz(tmp, flattenPA)
data[i, j, k] = omni_flux
return data
for idx,filename in enumerate(filenames):
# Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
data[idx] = compute(flux, PA)
Note flux and PA must be Numpy arrays. Also note that trapz is accurate as long as len(PA) is relatively small and np.std(PA) is not huge. Otherwise a pair-wise summation or even a (paranoid) Kahan summation should help (note Numpy use a pair-wise summation). In practice, results are the same on random normal numbers.
Further optimizations
The code can be made even faster by making flux accesses more contiguous. An efficient transposition can be used to do that (the one of Numpy is not efficient). However, this is not simple to do on 4D arrays. Another solution is to compute the trapz operation on whole lines of the k dimension. This makes the computation very efficient and nearly memory-bound on my machine. Here is the code:
#nb.njit('(float64[:,:,:,:], float64[:])', fastmath=True, parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
sinPA = np.sin(PA)
premultPA = PA * 0.5
for i in nb.prange(si):
for j in range(sj):
dir_flux = flux[:, i, j, :]
data[i, j, :].fill(0.0)
for l in range(sl-1):
dx = premultPA[l+1] - premultPA[l]
fact1 = dx * sinPA[l]
fact2 = dx * sinPA[l+1]
for k in range(sk):
data[i, j, k] += fact1 * dir_flux[l, k] + fact2 * dir_flux[l+1, k]
return data
Note the premultiplication make the computation slightly less precise.
Results
Here are results on random numbers (like #DominikStańczak used) on my 6-core machine (i5-9600KF processor):
Initial sequential solution: 193.14 ms (± 1.8 ms)
DominikStańczak sequential vectorized solution: 8.68 ms (± 48.1 µs)
Numba parallel solution without fastmath: 0.48 ms (± 6.7 µs)
Numba parallel solution without fastmath: 0.38 ms (± 9.5 µs)
Best Numba solution (with fastmath): 0.32 ms (± 5.2 µs)
Optimal lower-bound execution: 0.24 ms (RAM bandwidth saturation)
Thus, the fastest Numba version is 27 times faster than the (sequential) version of #DominikStańczak and 604 times faster than the initial one. It is nearly optimal.
As a first step, let's vectorize the code itself. I'm just going to deal with doing this on a per-file basis for now, to show you how to get rid of the nested for loop:
shape = (72, 35, 25, 20)
flux = np.random.normal(size=shape)
PA = np.random.normal(size=shape[0])
Now, timing your implementation, rewritten a little:
%%timeit
data = np.empty(shape[1:])
for i in range(shape[1]):
for j in range(shape[2]):
for k in range(shape[3]):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[i,j,k]=omni_flux
# 211 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My first idea was to pull sin out of the for loop, because there's no need to recalculate it each time, but that got me 10ms tops. However, if instead of the for loop, we used plain numpy vectorization via broadcasting, turning sin_PA into a (72, 1, 1, 1)-shaped array:
%%timeit
sin_PA = np.sin(PA).reshape(-1, 1, 1, 1)
data = np.trapz(flux * sin_PA, x=PA, axis=0)
# 9.03 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's a 20 times speed up, nothing to scoff at. I estimate it'd take about three minutes for all your files. You can also use np.allclose to verify the results agree up to floating point error.
If you still need to parallelize this afterwards, I would use dask.array In fact, if you've got your data in netcdf4 files, I would use xarray (which is helpful for multidimensional data anyway) to read those, and then run the trapz computations on that with Dask enabled in the backend. I think that's the simplest way to achieve easy multiprocessing in this case. Here's a quick sketch:
import xarray
from Dask.distributed import Client
client = Client()
file_data = xarray.open_mfdataset(filenames, parallel=True)
# massage the data a little, probably
flux = file_data["FluxHydrogen"]
PA = file_data["PitchAngleGrid"]
integrand = flux * np.sin(PA) # most element-wise numpy operations work on xarray ones or Dask based ones without a hitch
data = integrand.integrate(coord="PitchAngle") # or some such name for the dimension you're integrating out
Related
Im currently trying to do a monte carlo simulation, the problem is its taking quite a while to run 100,000 runs or more when Im told it shouldnt take very long.
Heres my code:
runs = 10000
import matplotlib.pyplot as plt
import random
import numpy as np
from scipy.stats import norm
from scipy.stats import uniform
import seaborn as sns
import pandas
def steadystate():
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = random.randint(86900000,2230000000000)
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
#results = steadystate (Faster)
results = np.array([steadystate() for _ in range(1000)])
print(results)
x+=1
ax = sns.distplot(results,
bins=100,
kde=True,
color='skyblue',
hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')
Im fairly new at python so Im unsure of where to optimize my code. Any help or suggestions would be much appreaciated.
You're not actually benefiting from numpy here, because you produce each value one at a time, doing all the math for that one value, then producing the array from the results. Work with arrays from the get-go, and do all the work on all elements in bulk to derive the benefits of vectorization:
import numpy.random
def steadystate(count): # Receive desired number of values for bulk generation
p=0.88
Cout=4700000000
LambdaAER=0.72
Vol=44.5
Depo=0.42
Uptime=0.1
Effic=0.38
Recirc=4.3
x = numpy.random.randint(86900000, 2230000000000, count) # Make array of count values all at once
# Perform all the math in bulk
conc = ((p*Cout*LambdaAER)+(x/Vol))/(LambdaAER+Depo+(Uptime*Effic*Recirc))
return conc
x = 0
while x < runs:
results = steadystate(1000) # Just call with number of desired items
print(results)
x+=1
Note that this code matches your original code by replacing results each time, rather than accumulating results. I'm not clear on what you what to do instead, so this is just doing the (probably) wrong thing much faster.
About 70% of the time you are losing is with the creation of the random numbers.
The question is whether you need each time random numbers? Would it be sufficient may be to generate the random matrix just once and reuse it.
However, the code is pretty quick isn't it. Except the drawing part this par took for one iteration just 1.2 ms.
%timeit results = np.array([steadystate() for _ in range(1000)])
1.24 ms ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am experiencing substantially slower matrix multiplication in R as compared to python. This is for large matrices. For example (in python):
import numpy as np
A = np.random.rand(4112, 23050).astype('float32')
B = np.random.rand(23050, 2500).astype('float32')
%timeit np.dot(A, B)
1 loops, best of 3: 1.09 s per loop
Here is the equivalent multiplication in R (takes almost 10x longer):
A <- matrix(rnorm(4112*23050), ncol = 23050)
B <- matrix(rnorm(23050*2500), ncol = 2500)
system.time(A %*% B)
user system elapsed
72.032 1.048 9.444
How can I achieve matrix multiplication speeds in R that are comparable to what is standard with python?
What I Have Already Tried:
1) Part of the descrepancy seems to be that python supports float32 whereas R only uses numeric, which is similar to (the same as?) float64. For example, the same python commands as above except with float64 takes twice as long (but still 5x slower than R):
import numpy as np
A = np.random.rand(4112, 23050).astype('float64')
B = np.random.rand(23050, 2500).astype('float64')
%timeit np.dot(A, B)
1 loops, best of 3: 2.24 s per loop
2) I am using the openBLAS linear algebra back-end for R.
3) RcppEigen as detailed in answer to this SO (see link for test.cpp file). The multiplication is about twice as fast in "user" time, but 3x slower in the more critical elapsed time as it only uses 1 of 8 threads.
library(Rcpp)
sourceCpp("test.cpp")
A <- matrix(rnorm(4112*23050), nrow = 4112)
B <- matrix(rnorm(23050*2500), ncol = 2500)
system.time(res <- eigenMatMult(A, B))
user system elapsed
29.436 0.056 29.551
I use MRO and python with anaconda and the MKL BLAS. Here are my results for the same data generating process, i.e. np.random.rand ('float64') or rnorm and identical dimensions (average and standard deviation over 10 replications ):
Python:
np.dot(A, B) # 1.3616 s (sd = 0.1776)
R:
Bt = t(B)
a = A %*% B # 2.0285 s (sd = 0.1897)
acp = tcrossprod(A, Bt) # 1.3098 s (sd = 0.1206)
identical(acp, a) # TRUE
Slightly tangential, but too long for a comment I think. To check whether the relevant compiler flags (e.g. -fopenmp) are set, use sourceCpp("testeigen.cpp",verbose=TRUE).
On my system, this showed that the OpenMP flags are not defined by default.
I did this to enable them (adapted from here):
library(Rcpp)
pkglibs <- "-fopenmp -lgomp"
pkgcxxflags <- "-fopenmp"
Sys.setenv(PKG_LIBS=pkglibs,PKG_CXXFLAGS=pkgcxxflags)
sourceCpp("testeigen.cpp",verbose=TRUE)
Dirk Eddelbuettel comments that he prefers to set the compiler flags in ~/.R/Makevars.
The example I took this from called the internal Rcpp:::RcppLdFlags and Rcpp:::RcppCxxFlags functions and prepended the results to the flags given above; this seems not to be necessary (?)
Consider the following:
fine = np.random.uniform(0,100,10)
fine[fine<20] = 0 # introduce some intermittency
coarse = np.sum(fine.reshape(-1,2),axis=1)
fine is a timeseries of magnitudes (e.g. volume of rainfall). coarse is the same timeseries but at a halved resolution, so every 2 timesteps in fine are aggregated to a single value in coarse.
I am then interested in the weighting that determines the proportions of the magnitude of coarse that corresponds to each timestep in fine for the instances where the value of coarse is above zero.
def w_xx(fine, coarse):
weights = []
for i, val in enumerate(coarse):
if val > 0:
w = fine[i*2:i*2+2]/val # returns both w1 and w2, w1 is 1st element, w2 = 1-w1 is second
weights.append(w)
return np.asarray(weights)
So w_xx(fine,coarse) would return an array of shape 5,2 where the elements of axis=1 are the weights of fine for a value of coarse.
This is all fine for smaller timeseries, but I'm running this analysis on ~60k-sized arrays of fine, plus in a loop of 300+ iterations.
I have been trying to make this run in parallel using the multiprocessing library in Python2.7 but I've not managed to get far. I need to be be reading both timeseries at the same time in order to get the corresponding values of fine for every value in coarse, plus to only work for values above 0, which is what my analysis requires.
I would appreciate suggestions on a better way to do this. I imagine if I can define a mapping function to use with Pool.map in multiprocessing, I should be able to parallelize this? I've only just started out with multiprocessing so I don't know if there is another way?
Thank you.
You can achieve the same result in a vectorized form by simply doing:
>>> (fine / np.repeat(coarse, 2)).reshape(-1, 2)
then you may filter out rows which coarse is zero, by using np.isfinite since if coarse is zero the output is either inf or nan.
In addition to the NumPy expression proposed by #behzad.nouri, you can use the Pythran compiler to reap extra speedups:
$ cat w_xx.py
#pythran export w_xx(float[], float[])
import numpy as np
def w_xx(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1, 2)
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 1.5 msec per loop
$ pythran w_xx.py -fopenmp -march=native # yes, this generates parallel code
$ python -m timeit -s 'import numpy as np; fine = np.random.uniform(0, 100, 100000); fine[fine<20] = 0; coarse = np.sum(fine.reshape(-1, 2), axis=1); from w_xx import w_xx' 'w_xx(fine, coarse)'
1000 loops, best of 3: 867 usec per loop
Disclaimer: I am a Pythran dev.
Excellent! I didn't know about np.repeat, thank you very much.
To answer my original question in the form it was presented, I've then also managed to make this work with multiprocessing:
import numpy as np
from multiprocessing import Pool
fine = np.random.uniform(0,100,100000)
fine[fine<20] = 0
coarse = np.sum(fine.reshape(-1,2),axis=1)
def wfunc(zipped):
return zipped[0]/zipped[1]
def wpar(zipped, processes):
p = Pool(processes)
calc = np.asarray(p.map(wfunc, zip(fine,np.repeat(coarse,2))))
p.close()
p.join()
return calc[np.isfinite(calc)].reshape(-1,2)
However, the suggestion by #behzad.nouri is evidently better:
def w_opt(fine, coarse):
w = (fine / np.repeat(coarse, 2))
return w[np.isfinite(w)].reshape(-1,2)
#using some iPython magic
%timeit w_opt(fine,coarse)
1000 loops, best of 3: 1.88 ms per loop
%timeit w_xx(fine,coarse)
1 loops, best of 3: 342 ms per loop
%timeit wpar(zip(fine,np.repeat(coarse,2)),6) #I've 6 cores at my disposal
1 loops, best of 3: 1.76 s per loop
Thanks again!
I need to calculate the local statistics of a image depending on a 2D Window block defined by the user. Stats include : Mean, Variance, Skew, Kurtosis. I need to traverse through each pixel of the image and find the neighboring pixels depending on the window size.
The code that I used was:
scipy.ndimage.generic_filter(array,numpy.var,size=3)
But the performance through this is very low. I even tried strides-numpy but that too isn't showing much difference (wasn't able to compute skewness, kurtosis). I'm not familiar with Cython so have not ventured into that option.
So is there any other way to accomplish this without Cython?
The reason uniform_filter() is so much faster than generic_filter() is due to Python -- for generic_filter(), Python gets called for each pixel, while for uniform_filter(), the whole image is processed in native code. (I found OpenCV's boxFilter() even faster than uniform_filter(), see my answer to a "window variance" question.)
In the remainder of this answer, I show how to do a skew calculation using uniform_filter(), which dramatically speeds up a generic_filter()-based version such as:
import scipy.ndimage as ndimage, scipy.stats as st
ndimage.generic_filter(img, st.skew, size=(1,5))
SciPy's st.skew() (see, e.g., v0.17.0) appears to calculate the skew as
m3 / m2**1.5
where m3 = E[(X-m)**3] (the third central moment), m2 = E[(X-m)**2] (the variance), and m = E[X] (the mean).
To use uniform_filter(), one has to write this in terms of raw moments such as m3p = E[X**3] and m2p = E[X**2] (a prime symbol is usually used to distinguish the raw moment from the central one):
m3 = E[(X-m)**3] = ... = m3p - 3*m*m2p + 2*m**3
m2 = E[(X-m)**2] = ... = m2p - m*m
(In case my "..." skips too much, this answer has the full derivation for m2.) Then one can implement skew() using uniform_filter() (or boxFilter() for some additional speedup):
def winSkew(img, wsize):
imgS = img*img
m, m2p, m3p = (ndimage.uniform_filter(x, wsize) for x in (img, imgS, imgS*img))
mS = m*m
return (m3p-3*m*m2p+2*mS*m)/(m2p-mS)**1.5
Compared to generic_filter(), winSkew() gives a 654-fold speedup on the following example on my machine:
In [185]: img = np.random.randint(0, 256, (500,500)).astype(np.float)
In [186]: %timeit ndimage.generic_filter(img, st.skew, size=(1,5))
1 loops, best of 3: 14.2 s per loop
In [188]: %timeit winSkew(img, (1,5))
10 loops, best of 3: 21.7 ms per loop
And the two calculations give essentially identical results:
In [190]: np.allclose(winSkew(img, (1,5)), ndimage.generic_filter(img, st.skew, size=(1,5)))
Out[190]: True
The code for a Kurtosis calculation can be derived the same way.
The problem is that generic_filter() cannot assume that your filter is separable along the x or y axes. Thus it must operate as a true 2D filter rather than a series of two 1D filters, so run-time will be much slower.
The mean filter and is equivalent (I think) to the uniform_filter(), which if you read the documentation, is implemented as a series of two 1d uniform filters.
I compared timing via this code block:
import numpy as np
from scipy import ndimage as ndi
from scipy import misc
baboonfile = '/Users/curt/Downloads/BaboonRGB.jpg' #local download of http://read.pudn.com/downloads169/sourcecode/graph/texture_mapping/776733/Picture/BaboonRGB__.jpg
im = misc.imread(baboonfile)
meanfilt2D = ndi.generic_filter(im, np.mean, size=[3, 3, 1])
%timeit meanfilt2D = ndi.generic_filter(im, np.mean, size=[3, 3, 1])
print meanfilt2D.shape
meanfiltU = ndi.uniform_filter(im, size=[3, 3, 1])
%timeit meanfiltU = ndi.uniform_filter(im, size=[3, 3, 1])
print meanfiltU.shape
The output of that block was:
1 loops, best of 3: 5.22 s per loop
(512, 512, 3)
100 loops, best of 3: 11.8 ms per loop
(512, 512, 3)
so true two-dimensional generic_filter() takes 5 seconds for a small image but the two-pass 1D uniform_filter() takes only milliseconds. (N.B.: The difference image meanfilt2D-meanfiltU was not identically zero, but the maximum element was 2; I think the differences are caused by rounding and the imprecise datatype (uint8) used for im.)
For variance and other filters, you should see this old Stack Overflow post which answers a highly related question.
I have a square matrix S (160 x 160), and a huge matrix X (160 x 250000). Both are dense numpy arrays.
My goal: find Q such that Q = inv(chol(S)) * X, where chol(S) is the lower cholesky factorization of S.
Naturally, a simple solution is
cholS = scipy.linalg.cholesky( S, lower=True)
scipy.linalg.solve( cholS, X )
My problem: this solution is noticeably slower (>2x) in python than when I try the same in Matlab. Here are some timing experiments:
timeit np.linalg.solve( cholS, X)
1 loops, best of 3: 1.63 s per loop
timeit scipy.linalg.solve_triangular( cholS, X, lower=True)
1 loops, best of 3: 2.19 s per loop
timeit scipy.linalg.solve( cholS, X)
1 loops, best of 3: 2.81 s per loop
[matlab]
cholS \ X
0.675 s
[matlab using only one thread via -singleCompThread]
cholS \ X
1.26 s
Basically, I'd like to know: (1) can I reach Matlab-like speeds in python? and (2) why is the scipy version so slow?
The solver should be able to take advantage of the fact that chol(S) is triangular. However, using numpy.linalg.solve() is faster than scipy.linalg.solve_triangular(), even though the numpy call doesn't use the triangular structure at all. What gives? The matlab solver seems to auto-detect when my matrix is triangular, but python cannot.
I'd be happy to use a custom call to BLAS/LAPACK routines for solving triangular linear systems, but I really don't want to write that code myself.
For reference, I'm using scipy version 11.0 and the Enthought python distribution (which uses Intel's MKL library for vectorization), so I think I should be able to reach Matlab-like speeds.
TL;DR: Don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.
I found this thread after stumbling across some discrepancies between numpy.linalg.solve and scipy.linalg.solve (and scipy's lu_solve, etc.). I don't have Enthought's MKL-based Numpy/Scipy, but I hope my findings can help you in some way.
With the pre-built binaries for Numpy and Scipy (32-bit, running on Windows 7):
I see a significant difference between numpy.linalg.solve and scipy.linalg.solve when solving for a vector X (i.e., X is 160 by 1). Scipy runtime is 1.23x numpy's, which is I think substantial.
However, most of the difference appears to be due to scipy's solve checking for invalid entries. When passing check_finite=False into scipy.linalg.solve, scipy's solve runtime is 1.02x numpy's.
Scipy's solve using destructive updates, i.e., with overwrite_a=True, overwrite_b=True is slightly faster than numpy's solve (which is non-destructive). Numpy's solve runtime is 1.021x destructive scipy.linalg.solve. Scipy with just check_finite=False has runtime 1.04x the destructive case. In summary, destructive scipy.linalg.solve is very slightly faster than either of these cases.
The above are for a vector X. If I make X a wide array, specifically 160 by 10000, scipy.linalg.solve with check_finite=False is essentially as fast as with check_finite=False, overwrite_a=True, overwrite_b=True. Scipy's solve (without any special keywords) runtime is 1.09x this "unsafe" (check_finite=False) call. Numpy's solve has runtime 1.03x scipy's fastest for this array X case.
scipy.linalg.solve_triangular delivers significant speedups in both these cases, but you have to turn off input checking, i.e., pass in check_finite=False. The runtime for the fastest solve was 5.68x and 1.76x solve_triangular's, for vector and array X, respectively, with check_finite=False.
solve_triangular with destructive computation (overwrite_b=True) gives you no speedup on top of check_finite=False (and actually hurts slightly for the array X case).
I, ignoramus, was previously unaware of solve_triangular and was using scipy.linalg.lu_solve as a triangular solver, i.e., instead of solve_triangular(cholS, X) doing lu_solve((cholS, numpy.arange(160)), X) (both produce the same answer). But I discovered that lu_solve used in this way has runtime 1.07x unsafe solve_triangular for the vector X case, while its runtime was 1.76x for the array X case. I'm not sure why lu_solve is so much slower for array X, compared to vector X, but the lesson is to use solve_triangular (without infinite checks).
Copying the data to Fortran format didn't seem to matter at all. Neither does converting to numpy.matrix.
I might as well compare my non-MKL Python libraries against single-threaded (maxNumCompThreads=1) Matlab 2013a. The fastest Python implementations above had 4.5x longer runtime for the vector X case and 6.3x longer runtime for the fat matrix X case.
However, here's the Python script I used to benchmark these, perhaps someone with MKL-accelerated Numpy/Scipy can post their numbers. Note that I just comment out the line n = 10000 to disable the fat matrix X case and do the n=1 vector case. (Sorry.)
import scipy.linalg as sla
import numpy.linalg as nla
from numpy.random import RandomState
from timeit import timeit
import numpy as np
RNG = RandomState(69)
m=160
n=1
#n=10000
Ac = RNG.randn(m,m)
if 1:
Ac = np.triu(Ac)
bc = RNG.randn(m,n)
Af = Ac.copy("F")
bf = bc.copy("F")
if 0: # Save to Matlab format
import scipy.io as io
io.savemat("b_%d.mat"%(n,), dict(A=Ac, b=bc))
import sys
sys.exit(0)
def lapper(fn, source, **kwargs):
Alocal = source[0].copy()
blocal = source[1].copy()
fn(Alocal, blocal,**kwargs)
laps = (1000 if n<=1 else 100)
def printer(t, s=''):
print ("%g seconds, %d laps, " % (t/float(laps), laps)) + s
return t/float(laps)
t=[]
print "C"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc)), number=laps),
"scipy.solve"))
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), check_finite=False),
number=laps), "scipy.solve, infinite-ok"))
t.append(printer(timeit(lambda: lapper(nla.solve, (Ac,bc)), number=laps),
"numpy.solve"))
#print "F" # Doesn't seem to matter
#printer(timeit(lambda: lapper(sla.solve, (Af,bf)), number=laps))
#printer(timeit(lambda: lapper(nla.solve, (Af,bf)), number=laps))
print "sla with tweaks"
t.append(printer(timeit(lambda: lapper(sla.solve, (Ac,bc), overwrite_a=True,
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve destructive"))
print "Tri"
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc)),
number=laps), "scipy.solve_triangular"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
check_finite=False), number=laps),
"scipy.solve_triangular, inf-ok"))
t.append(printer(timeit(lambda: lapper(sla.solve_triangular, (Ac,bc),
overwrite_b=True, check_finite=False),
number=laps), "scipy.solve_triangular destructive"))
print "LU"
piv = np.arange(m)
t.append(printer(timeit(lambda: lapper(
lambda X,b: sla.lu_solve((X, piv),b,check_finite=False),
(Ac,bc)), number=laps), "LU"))
print "all times:"
print t
Output of the above script for the vector case, n=1:
C
0.000739405 seconds, 1000 laps, scipy.solve
0.000624746 seconds, 1000 laps, scipy.solve, infinite-ok
0.000590003 seconds, 1000 laps, numpy.solve
sla with tweaks
0.000608365 seconds, 1000 laps, scipy.solve destructive
Tri
0.000208711 seconds, 1000 laps, scipy.solve_triangular
9.38371e-05 seconds, 1000 laps, scipy.solve_triangular, inf-ok
9.37682e-05 seconds, 1000 laps, scipy.solve_triangular destructive
LU
0.000100215 seconds, 1000 laps, LU
all times:
[0.0007394047886284343, 0.00062474593940593, 0.0005900030818282472, 0.0006083650710913095, 0.00020871054023307778, 9.383710445114923e-05, 9.37682389063692e-05, 0.00010021534750467032]
Output of the above script for the matrix case n=10000:
C
0.118985 seconds, 100 laps, scipy.solve
0.113687 seconds, 100 laps, scipy.solve, infinite-ok
0.115569 seconds, 100 laps, numpy.solve
sla with tweaks
0.113122 seconds, 100 laps, scipy.solve destructive
Tri
0.0725959 seconds, 100 laps, scipy.solve_triangular
0.0634396 seconds, 100 laps, scipy.solve_triangular, inf-ok
0.0638423 seconds, 100 laps, scipy.solve_triangular destructive
LU
0.1115 seconds, 100 laps, LU
all times:
[0.11898513112988955, 0.11368747217793944, 0.11556863916356903, 0.11312182352918797, 0.07259593807427585, 0.0634396208970783, 0.06384230931663318, 0.11150022257648459]
Note that the above Python script can save its arrays as Matlab .MAT data files. This is currently disabled (if 0, sorry), but if enabled, you can test Matlab's speed on the exact same data. Here's a timing script for Matlab:
clear
q = load('b_10000.mat');
A=q.A;
b=q.b;
clear q
matrix_time = timeit(#() A\b)
q = load('b_1.mat');
A=q.A;
b=q.b;
clear q
vector_time = timeit(#() A\b)
You'll need the timeit function from Mathworks File Exchange: http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function. This produces the following output:
matrix_time =
0.0099989
vector_time =
2.2487e-05
The upshot of this empirical analysis is, in Python at least, don't use numpy's or scipy's solve when you have a triangular system, just use scipy.linalg.solve_triangular with at least the check_finite=False keyword argument for fast and non-destructive solutions.
Why not just use the equation: Q = inv(chol(S)) * X, here is my test:
import scipy.linalg
import numpy as np
N = 160
M = 100000
S = np.random.randn(N, N)
B = np.random.randn(N, M)
S = np.dot(S, S.T)
cS = scipy.linalg.cholesky(S, lower=True)
Y1 = scipy.linalg.solve(cS, B)
icS = scipy.linalg.inv(cS)
Y2 = np.dot(icS, B)
np.allclose(Y1, Y2)
output:
True
Here is the time test:
%time scipy.linalg.solve(cholS, B)
%time np.linalg.solve(cholS, B)
%time scipy.linalg.solve_triangular(cholS, B, lower=True)
%time ics=scipy.linalg.inv(cS);np.dot(ics, B)
output:
CPU times: user 2.07 s, sys: 0.00 s, total: 2.07 s
Wall time: 2.08 s
CPU times: user 1.93 s, sys: 0.00 s, total: 1.93 s
Wall time: 1.92 s
CPU times: user 1.12 s, sys: 0.00 s, total: 1.12 s
Wall time: 1.13 s
CPU times: user 0.71 s, sys: 0.00 s, total: 0.71 s
Wall time: 0.72 s
I don't know why scipy.linalg.solve_triangular is slower than numpy.linalg.solve on your system, but the inv version is the fastest.
A couple of things to try:
X = X.copy('F') # use fortran-order arrays, so that a copy is avoided
Y = solve_triangular(cholS, X, overwrite_b=True) # avoid another copy, but trash contents of X
Y = solve_triangular(cholS, X, check_finite=False) # Scipy >= 0.12 only --- but doesn't seem to have a large effect on speed...
With both of these, it should be pretty much equivalent to a direct call to MKL with no buffer copies.
I can't reproduce the issue with np.linalg.solve and scipy.linalg.solve having different speeds --- with the BLAS + LAPACK combination I have, both seem the same speed.