Faster exponentiation of complex arrays in Python using Arrayfire

Faster exponentiation of complex arrays in Python using Arrayfire - python

According to the arrayfire pow documentation, af.pow() currently only supports powers (and roots...) of real arrays. No error is thrown, but I found that using af.pow() with complex input can cause a huge memory leak, especially if other functions are used as input (for example, af.pow(af.ifft(array), 2)).
To get around this, I have written the function complexPow below. This seems to run for complex arrays without the memory leak, and a quick comparison showed that my complexPow function returns the same values as numpy.sqrt() and the ** operator, for example.
def complexPow(inData, power):
for i in af.ParallelRange(inData.shape[0]):
theta = af.atan(af.imag(inData[i])/af.real(inData[i]))
rSquared = af.pow(af.real(inData[i]), 2.0) + \
af.pow(af.imag(inData[i]), 2.0)
r = af.pow(rSquared, .5)
inData[i] = af.pow(r, power) * (af.cos(theta*power) + \
1j*af.sin(theta*power))
return inData
Is there a faster way of doing parallel element-wise exponentiation than this? I haven't found one, but scared I'm missing a trick here...

This is a little faster without the parallel for loop:
def complexPow(inData, power):
theta = af.atan(af.imag(inData)/af.real(inData))
r = af.pow(af.pow(af.real(inData), 2.0) +
af.pow(af.imag(inData), 2.0), .5)
inData = af.pow(r, power) * (af.cos(theta*power) + \
1j*af.sin(theta*power))
return inData
Tetsted for 4000 iterations over a dtype=complex array with dimensions (1, 2**18) using nvidia Quadro K4200, Spyder 3, Python 2.7, Windows 7:
Using af.ParallelRange: 7.64 sec (1.91 msec per iteration).
Method above: 5.94 sec (1.49 msec per iteration).
Speed increase: 28%.

Related

Huge memory requirement difference between JAX 0.2.17 and JAX 0.4.1

Follow up of the question:
Is it possible to improve python performance for this code?
When using the functions from the accepted answer, with or without jax (bar or jit_bar):
T = np.random.rand(5000, 566, 3)
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
msd = jit_bar(T)
Sending a (10000x566x3) array to the function give me a stable memory usage of 1.5 GB with python3.6, with python = 3.8, 3.9, 3.10, 3.11 the memory skyrocket to +50 GB.
EDIT:
After some trials it seems to be related to jax only, this code will run fine with:
python3.6, jax (0.2.17), jaxlib (0.1.68), numpy (1.19.2)
but not with:
python3.11, jax (0.4.1), jaxlib (0.4.1), numpy (1.24.1)

If Y is of shape (10000, 566, 3) Then triu_indices returns arrays of length (10000 * 10001) / 2, and so Y[u] and Y[v] are each of size (50005000, 566, 3). If they are float32 values, then that size is about 316 GB each. I would not expect this code to run well anywhere!
I suspect that older JAX versions may have had some additional optimization that was removed in later versions; given the form of your computation, the only thing that could have been is a factorization of the square difference to avoid instantiating the full matrix sum, which I vaguely recall was previously an XLA optimization but was removed because it's numerically unstable.
But you can do such an optimization manually if you wish; here's an approach that seems to work, and the largest intermediate array it generates for the original inputs is of shape [10000, 10000], about ~380MB in float32:
#jax.jit
def jit_bar2(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
Y = Y.reshape(Y.shape[0], -1)
Y2m = (Y ** 2).mean(-1)
YYTm = (Y # Y.T) / Y.shape[1]
return jnp.sqrt(3 * (Y2m[u] + Y2m[v] - 2 * YYTm[u, v]))
T = np.random.rand(50, 6, 3) # test with a smaller input
np.testing.assert_allclose(jit_bar(T), jit_bar2(T), atol=1E-5)

Integrating Out Dimension from MultiDimensional Array using Parallel Processing

I was hoping to find some clever approaches to solving a parallel-processing problem I've been struggling with. Basically, I am dealing with 20,160 multidimensional arrays with size (72,35,25,20). Currently, I'm integrating out the dimension with size 72 by simply doing a trapezoidal integration in a nested for-loop. My end goal is to get an output array with size (20160,35,25,20).
for idx,filename in enumerate(filenames):
#Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
for i in range(35):
for j in range(25):
for k in range(20):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[idx,i,j,k]=omni_flux #This will have size (20160,35,25,20)
I believe it would be most beneficial to implement the parallelization lower in the nested for-loop but can't seem to figure out how. I have searched for common questions, but none [that I have found] provide enough insight into how to implement shared memory, pass multidimensional arrays to the pools, and/or reshape the resulting array. Any help or insight would be greatly appreciated.

You can use Numba so to speed up this code by a large margin. Numba is a JIT compiler that is able to compile Numpy-based code to fast native codes (so loops are not a issue with it, in fact this is a good idea to use loops in Numba).
The first thing to do is to pre-compute np.sin(PA) once so to avoid repeated computations. Then, dir_flux * np.sin(PA) can be computed using a for loop and the result can be stored in a pre-allocated array so not to perform millions of expensive small array allocations. The outer loop can be executed using multiple threads using prange and the Numba flag parallel=True. It can be further accelerated using the flag fastmath=True assuming the input values are not special (like NaN or Inf or very very small: see subnormal numbers).
While this should theoretically enough got get a fast code, the current implementation of np.trapz is not efficient as it performs expensive allocations. One can easily rewrite the function so not to allocated any additional arrays.
Here are the resulting code:
import numpy as np
import numba as nb
#nb.njit('(float64[::1], float64[::1])')
def trapz(y, x):
s = 0.0
for i in range(x.size-1):
dx = x[i+1] - x[i]
dy = y[i] + y[i+1]
s += dx * dy
return s * 0.5
#nb.njit('(float64[:,:,:,:], float64[:])', parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
flattenPA = np.ascontiguousarray(PA)
sinPA = np.sin(flattenPA)
for i in nb.prange(si):
tmp = np.empty(sl)
for j in range(sj):
for k in range(sk):
dir_flux = flux[:, i, j, k]
for l in range(sl):
tmp[l] = dir_flux[l] * sinPA[l]
omni_flux = trapz(tmp, flattenPA)
data[i, j, k] = omni_flux
return data
for idx,filename in enumerate(filenames):
# Read NetCDF Data File as 'raw_data'
flux=raw_data['FluxHydrogen'][:] #This is size (72,35,25,20)
PA=raw_data['PitchAngleGrid'][:] #This is size (72)
data[idx] = compute(flux, PA)
Note flux and PA must be Numpy arrays. Also note that trapz is accurate as long as len(PA) is relatively small and np.std(PA) is not huge. Otherwise a pair-wise summation or even a (paranoid) Kahan summation should help (note Numpy use a pair-wise summation). In practice, results are the same on random normal numbers.
Further optimizations
The code can be made even faster by making flux accesses more contiguous. An efficient transposition can be used to do that (the one of Numpy is not efficient). However, this is not simple to do on 4D arrays. Another solution is to compute the trapz operation on whole lines of the k dimension. This makes the computation very efficient and nearly memory-bound on my machine. Here is the code:
#nb.njit('(float64[:,:,:,:], float64[:])', fastmath=True, parallel=True)
def compute(flux, PA):
sl, si, sj, sk = flux.shape
assert sl == PA.size
data = np.empty((si, sj, sk))
sinPA = np.sin(PA)
premultPA = PA * 0.5
for i in nb.prange(si):
for j in range(sj):
dir_flux = flux[:, i, j, :]
data[i, j, :].fill(0.0)
for l in range(sl-1):
dx = premultPA[l+1] - premultPA[l]
fact1 = dx * sinPA[l]
fact2 = dx * sinPA[l+1]
for k in range(sk):
data[i, j, k] += fact1 * dir_flux[l, k] + fact2 * dir_flux[l+1, k]
return data
Note the premultiplication make the computation slightly less precise.
Results
Here are results on random numbers (like #DominikStańczak used) on my 6-core machine (i5-9600KF processor):
Initial sequential solution: 193.14 ms (± 1.8 ms)
DominikStańczak sequential vectorized solution: 8.68 ms (± 48.1 µs)
Numba parallel solution without fastmath: 0.48 ms (± 6.7 µs)
Numba parallel solution without fastmath: 0.38 ms (± 9.5 µs)
Best Numba solution (with fastmath): 0.32 ms (± 5.2 µs)
Optimal lower-bound execution: 0.24 ms (RAM bandwidth saturation)
Thus, the fastest Numba version is 27 times faster than the (sequential) version of #DominikStańczak and 604 times faster than the initial one. It is nearly optimal.

As a first step, let's vectorize the code itself. I'm just going to deal with doing this on a per-file basis for now, to show you how to get rid of the nested for loop:
shape = (72, 35, 25, 20)
flux = np.random.normal(size=shape)
PA = np.random.normal(size=shape[0])
Now, timing your implementation, rewritten a little:
%%timeit
data = np.empty(shape[1:])
for i in range(shape[1]):
for j in range(shape[2]):
for k in range(shape[3]):
dir_flux=flux[:,i,j,k]
omni_flux=np.trapz(dir_flux*np.sin(PA),PA)
data[i,j,k]=omni_flux
# 211 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My first idea was to pull sin out of the for loop, because there's no need to recalculate it each time, but that got me 10ms tops. However, if instead of the for loop, we used plain numpy vectorization via broadcasting, turning sin_PA into a (72, 1, 1, 1)-shaped array:
%%timeit
sin_PA = np.sin(PA).reshape(-1, 1, 1, 1)
data = np.trapz(flux * sin_PA, x=PA, axis=0)
# 9.03 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's a 20 times speed up, nothing to scoff at. I estimate it'd take about three minutes for all your files. You can also use np.allclose to verify the results agree up to floating point error.
If you still need to parallelize this afterwards, I would use dask.array In fact, if you've got your data in netcdf4 files, I would use xarray (which is helpful for multidimensional data anyway) to read those, and then run the trapz computations on that with Dask enabled in the backend. I think that's the simplest way to achieve easy multiprocessing in this case. Here's a quick sketch:
import xarray
from Dask.distributed import Client
client = Client()
file_data = xarray.open_mfdataset(filenames, parallel=True)
# massage the data a little, probably
flux = file_data["FluxHydrogen"]
PA = file_data["PitchAngleGrid"]
integrand = flux * np.sin(PA) # most element-wise numpy operations work on xarray ones or Dask based ones without a hitch
data = integrand.integrate(coord="PitchAngle") # or some such name for the dimension you're integrating out

Numba function take long time for assign value to an array

I wrote a function to calculate the HOG of an image by Numba, and I ran it on 7000 images. it takes 10 sec time. but when I commented on the line that assigns a variable into an array ( hist[idx] += mag ), the time decreased into 5 milliseconds. what is the problem and what should I do about this.
#numba.jit( numba.uint64[:]( numba.uint8[:,:],numba.uint8), nopython=True )
def hog_numba( img, bins ):
h,w = img.shape
hist = np.zeros( bins, dtype=np.uint64)
for i in range(h-1):
for j in range(w-1):
cy = img[i-1,j-1]*1 + img[i-1,j]*2 + img[i-1,j+1]*1 + img[i+1,j-1]*-1 + img[i+1,j]*-2 + img[i+1,j+1]*-1
cx = img[i-1,j-1]*1 + img[i,j-1]*2 + img[i+1,j-1]*1 + img[i-1,j+1]*-1 + img[i,j+1]*-2 + img[i+1,j+1]*-1
mag = numba.uint32(math.sqrt( math.pow(cx,2) + math.pow(cy,2) ) )
if cx!=0:
ang = math.atan2( cy, cx)#arc_tang
else :
if cy>0:
ang = math.pi / 2
else:
ang = -math.pi / 2
if ang<0:
ang = abs(ang) + math.pi
idx = (ang * bins) // (math.pi * 2 )
idx = int(idx)
#hist[idx] += mag
return hist
below code used for benchmark
for _ in range(20):
print('start')
t = time.time()
hists = []
for i in range(8000):
hist = hog_numba(img, 10)
t = time.time() - t
print('time:',t)

The difference in speed is not due to the fact that assignment is slow but due to the optimization of the JIT compiler. Indeed, if you comment the line hist[idx] += mag, then Numba can see that mag and idx do not need to be computed and can just remove the associated lines. Transitively, it can also remove the computation of ang, cx and cy. Finally it can fully remove the two nested loops. Such a code will be much faster but also useless. However, the JIT may not fully remove all the operation inside the two nested loops in practice since the JIT may not be able to fully optimize the code possibly due to Python transformations, guards and side effects. On my machine is does optimize the loop to a no-op. Indeed, it takes lass than 1 ms in average to compute a 8000 images of size (16_000,16_000) which is totally impossible on my machine (it should be at least 1000 times slower).
Thus, you cannot measure the time of an isolated instruction by just removing it and look for the time difference with Numba (or any optimized compiled code). Modern compilers are very advanced and trying to defeat them is not easy. If you still want to see if the cost actually comes mainly from the assignment, you could try to perform a summation like mag_sum += mag, idx_sum += idx and return/print the summation variables (otherwise the compiler can see that they are useless as they do not cause visible changes). On my machine the assignment version is only 9% slower than an implementation use a summation showing the assignment does not take most of the execution time (despite not being very fast probably due to the random access pattern).
The main source of slow down comes from the line (ang * bins) // (math.pi * 2 ) and more specifically from the multiplication/division by a constant. Pre-computing bins / (math.pi * 2) in a temporary variable ahead of time result in a 3.5 times faster code. The code is far from being optimized. Further optimizations include using vectorization, branch-less operations and parallelism (using simple precision and trying to remove the math.atan2 call may also help).

Matrix multiplication speeds in R as fast as in Python?

I am experiencing substantially slower matrix multiplication in R as compared to python. This is for large matrices. For example (in python):
import numpy as np
A = np.random.rand(4112, 23050).astype('float32')
B = np.random.rand(23050, 2500).astype('float32')
%timeit np.dot(A, B)
1 loops, best of 3: 1.09 s per loop
Here is the equivalent multiplication in R (takes almost 10x longer):
A <- matrix(rnorm(4112*23050), ncol = 23050)
B <- matrix(rnorm(23050*2500), ncol = 2500)
system.time(A %*% B)
user system elapsed
72.032 1.048 9.444
How can I achieve matrix multiplication speeds in R that are comparable to what is standard with python?
What I Have Already Tried:
1) Part of the descrepancy seems to be that python supports float32 whereas R only uses numeric, which is similar to (the same as?) float64. For example, the same python commands as above except with float64 takes twice as long (but still 5x slower than R):
import numpy as np
A = np.random.rand(4112, 23050).astype('float64')
B = np.random.rand(23050, 2500).astype('float64')
%timeit np.dot(A, B)
1 loops, best of 3: 2.24 s per loop
2) I am using the openBLAS linear algebra back-end for R.
3) RcppEigen as detailed in answer to this SO (see link for test.cpp file). The multiplication is about twice as fast in "user" time, but 3x slower in the more critical elapsed time as it only uses 1 of 8 threads.
library(Rcpp)
sourceCpp("test.cpp")
A <- matrix(rnorm(4112*23050), nrow = 4112)
B <- matrix(rnorm(23050*2500), ncol = 2500)
system.time(res <- eigenMatMult(A, B))
user system elapsed
29.436 0.056 29.551

I use MRO and python with anaconda and the MKL BLAS. Here are my results for the same data generating process, i.e. np.random.rand ('float64') or rnorm and identical dimensions (average and standard deviation over 10 replications ):
Python:
np.dot(A, B) # 1.3616 s (sd = 0.1776)
R:
Bt = t(B)
a = A %*% B # 2.0285 s (sd = 0.1897)
acp = tcrossprod(A, Bt) # 1.3098 s (sd = 0.1206)
identical(acp, a) # TRUE

Slightly tangential, but too long for a comment I think. To check whether the relevant compiler flags (e.g. -fopenmp) are set, use sourceCpp("testeigen.cpp",verbose=TRUE).
On my system, this showed that the OpenMP flags are not defined by default.
I did this to enable them (adapted from here):
library(Rcpp)
pkglibs <- "-fopenmp -lgomp"
pkgcxxflags <- "-fopenmp"
Sys.setenv(PKG_LIBS=pkglibs,PKG_CXXFLAGS=pkgcxxflags)
sourceCpp("testeigen.cpp",verbose=TRUE)
Dirk Eddelbuettel comments that he prefers to set the compiler flags in ~/.R/Makevars.
The example I took this from called the internal Rcpp:::RcppLdFlags and Rcpp:::RcppCxxFlags functions and prepended the results to the flags given above; this seems not to be necessary (?)

Accelerating scientific python program

I have the following code in python:
def P(z, u0):
x = np.inner(z, u0)
tmp = x*u0
return (z - tmp)
def powerA2(A, u0):
x0 = np.random.rand(len(A))
for i in range(ITERATIONS):
x0 = P(np.dot(A, x0), u0)
x0 = x0 / np.linalg.norm(x0)
return (np.inner(np.dot(A, x0), x0))
np is numpy package.
I am interested in running this code for matrices in size of 100,000 * 100,000, but it seems that there is no chance for this program to run fast (I need to run it many times, about 10,000).
Is there any chance that tricks like multi-threading would work here?
Does anything else help to accelerate it?

You could consider using Pythran. Compiling the following code (norm.py):
#pythran export powerA2(float [][], float[])
import numpy as np
def P(z, u0):
x = np.inner(z, u0)
tmp = x*u0
return (z - tmp)
def norm(x):
return np.sqrt(np.sum(np.abs(x)**2))
def powerA2(A, u0):
ITERATIONS = 100
x0 = np.random.random(len(A))
for i in range(ITERATIONS):
x0 = P(np.dot(A, x0), u0)
x0 = x0 / norm(x0)
return (np.inner(np.dot(A, x0), x0))
with:
pythran norm.py
yields the following speedup:
$ python -m timeit -s 'import numpy as np; A = np.random.rand(100, 100); B = np.random.random(100); import norm' 'norm.powerA2(A, B)'
100 loops, best of 3: 3.1 msec per loop
$ pythran norm.py -O3 -march=native
$ python -m timeit -s 'import numpy as np; A = np.random.rand(100, 100); B = np.random.random(100); import norm' 'norm.powerA2(A, B)'
1000 loops, best of 3: 937 usec per loop

Just to check: you want to do 10^4 operations of something 10^10... so even if your operation is O(1), that's still 10^14 operations, which is a pretty hard problem (and as haraldkl pointed out in his comment, this is also eating a ton of memory) Just to check: are you going to call powerA2 10,000 times, or is 10,000 your desired value for ITERATIONS. If the former, you could use threads (or better yet, separate processes) to get some parallization but I don't know if that's going to be enough; if the latter, unless there's a trick I'm missing, your inputs don't seem as paralizable since the input for each loop iteration depend on the outputs of the previous. There may be a way to do this on GPU (I would like to think there'd be an efficient way to at least do the normalization bit such that it could do large numbers of stuff quickly by using vectorization)
Edit in response to comment: cpython (which is the most common python implementation) has a Global Interpeter Lock (GIL); some other python implementations (jython, ironpython) do not; per https://wiki.python.org/moin/GlobalInterpreterLock, .
Note that potentially blocking or long-running operations, such as
I/O, image processing, and NumPy number crunching, happen outside the
GIL. Therefore it is only in multithreaded programs that spend a lot
of time inside the GIL, interpreting CPython bytecode, that the GIL
becomes a bottleneck.
As far as I know, it should be possible to use threads with numpy and not be horribly bottlenecked but your problem still looks hard to convert to threads unless there's some bit of math I'm missing.

I get a 10% improvement over the uncompilled serge-sans-paille version by redefining functions this way:
def P0(z, u0):
x = np.inner(z, u0)
x *= u0
return (z - x)
def norm0(x):
return np.sqrt(np.sum(x*x))
def powerA20(A, u0):
ITERATIONS = 100
x0 = np.random.random(len(A))
for i in range(ITERATIONS):
x0 = P0(np.dot(A, x0), u0)
x0 /= norm0(x0)
return (np.inner(np.dot(A, x0), x0))
Doing things like *= u0 instead of x = x*u0 avoids unnecesary copies of the variables in RAM, speeding the program up a little bit.
Also, you don't need abs in that case. And finally, x*x is slightly faster than x**2.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.