Python numpy.fft changes strides - python

Dear stackoverflow community!
Today I found that on a high-end cluster architecture, an elementwise multiplication of 2 cubes with dimensions 1921 x 512 x 512 takes ~ 27 s. This is much too long since I have to perform such computations at least 256 times for an azimuthal averaging of a power spectrum in the current implementation. I found that the slow performance was mainly due to different stride structures (C in one case and FORTRAN in the other). One of the two arrays was a newly generated boolean grid (C order) and the other one (FORTRAN order) came from the 3D numpy.fft.fftn() Fourier transform of an input grid (C order). Any reasons why numpy.fft.fftn() changes the strides and ideas on how to prevent that except for reversing the axes (which would be just a workaround)? With similar strides (ndarray.copy() of the FT grid), ~ 4s are achievable, a tremendous improvement.
The question is therefore as following:
Consider the array:
ran = np.random.rand(1921, 512, 512)
ran.strides
(2097152, 4096, 8)
a = np.fft.fftn(ran)
a.strides
(16, 30736, 15736832)
As we can see the stride structure is different. How can this be prevented (without using a = np.fft.fftn(ran, axes = (1,0)))? Are there any other numpy array routines that could affect stride structure? What can be done in those cases?
Helpful advice is as usual much appreciated!

You could use scipy.fftpack.fftn (as suggested by hpaulj too) rather than numpy.fft.fftn, looks like it's doing what you want. It is however slightly less performing:
import numpy as np
import scipy.fftpack
ran = np.random.rand(192, 51, 51) # not much memory on my laptop
a = np.fft.fftn(ran)
b = scipy.fftpack.fftn(ran)
ran.strides
(20808, 408, 8)
a.strides
(16, 3072, 156672)
b.strides
(41616, 816, 16)
timeit -n 100 np.fft.fftn(ran)
100 loops, best of 3: 37.3 ms per loop
timeit -n 100 scipy.fftpack.fftn(ran)
100 loops, best of 3: 41.3 ms per loop

Any reasons why numpy.fft.fftn() changes the strides and ideas on how to prevent that except for reversing the axes (which would be just a workaround)?
Computing the multidimensionnal DFT of an array consists in successively computing 1D DTFs over each dimensions. There are two strategies:
Restrict 1D DTF computations to contiguous 1D arrays. As the array is contiguous, problem related to latency/cache misses will be reduced. This strategy has a major drawback: the array is to be transposed between each dimension. It is likely the strategy adopted by numpy.fft. At the end of computations, the array has been transposed. To avoid unnecessary computations, the transposed array is returned and strides are modified.
Enable 1D DDFT computations for strided arrays. This might trigger some problem related to latency. It is the strategy of fftw, avaible through the interface pyfftw. As a result, the output array features the same strides as the input array.
Timing numpy.fftn and pyfftw.numpy.fftn as performed here and there or there will tell you whether FFTW is really the Fastest Fourier Transform in the West or not...
To check that numpy uses the first strategy, take a look at numpy/fft/fftpack.py. At line 81-85, the call to work_function(a, wsave) (i.e. fftpack.cfftf, from FFTPACK, arguments documented there) is enclosed between calls to numpy.swapaxes() performing the transpositions.
scipy.fftpack.fftn does not seem to change the strides... Nevertheless, it seems that it makes use of the first strategy. scipy.fftpack.fftn() calls scipy.fftpack.zfftnd() which calls zfft(), based on zfftf1 which does not seem to handle strided DFTs. Moreover, zfftnd() calls many times the function flatten() which performs the transposition.
Another example: for parallel distributed memory multidimensionnal DFTs, FFTW-MPI uses the first strategy to avoid any MPI communications between processes during 1D DTFs. Of course, functions to transpose the array are not far away and a lot a MPI communications are involved in the process.
Are there any other numpy array routines that could affect stride structure? What can be done in those cases?
You can search the github repository of numpy for swapaxes: this funtion is only used a couple of times. Hence, to my mind, this "change of strides" is particular to fft.fftn() and most numpy functions keep the strides unchanged.
Finally, the "change of strides" is a feature of the first strategy and there is no way to prevent that. The only workaround is to swap the axes at the end of the computation, which is costly. But you can rely on pyfftw since fftw implements the second strategy in a very efficient way. The DFT computations will be faster, and subsequent computations will be faster as well if the strides of the different arrays become consistent.

Related

Fast Savgol Filter on 3D Tensor

I have a tensor of example shape (543, 133, 3), meaning 543 frames, with 133 points of X,Y,Z
I would like to run a savgol_filter on every point in every dimension, however, naively, this is quite slow:
points, frames, dims = tensor.shape
new_data = []
for point in range(points):
new_dims = []
for dim in range(dims):
new_dims.append(scipy.signal.savgol_filter(data[point, :, dim], 3, 1))
new_data.append(new_dims)
tensor = np.array(new_data)
On my computer, for this small tensor, this takes 300ms, which is quite a long time.
Is there a way to make this faster?
This is by no means the fastest method, but it should be quite a lot faster than what you're currently doing. We can utilize vectorized operations instead of for loops to achieve much better performance.
From your code, it seems like you want to smooth the 133 dimension (dim 1) and so you could apply SavGol all at once with
savgol_filter(data, 3, 1, axis=1). In general you can specify the axis on which you'd like to apply the filter. On my computer, this brought the computation from 500ms to 2ms.
A side note: Since you care about performance, I would pay attention to what your data order is. Depending on what you're doing, it might be advisable to reorder your data once to save time.
For example: Let's say you have a matrix of 5 signals (5x299). If you wanted to get a single signal. That's easy! Try signal[0]. This doesn't actually require copying the data and we can just "view" it in memory. But what if you wanted to select a particular band in the signal? If you do signal[:,0] then you can't take a "view" of the memory because first you need to access every signal and take that index. If you had first transposed the matrix, then the first index is just the band of every spectra that you want -- no need for iteration. Data order can be an important part of getting the best performance out of your computations.
There are two related concepts here: contiguous memory and vectorized operations. My explanation of why data order is important has some more complications, and you will need to do your own research to determine what data ordering will give you the best performance for your application. The big things to watch out for are C v Fortran contiguous memory layout.
Here are some resources I found: (not an endorsement)
StackOverflow article on contigous memory: What is the difference between contiguous and non-contiguous arrays?
Towards Data Science article on vectorized operations https://towardsdatascience.com/vectorization-must-know-technique-to-speed-up-operations-100x-faster-50b6e89ddd45

Faster numpy array indexing when using condition (numpy.where)?

I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?
np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.

Iterative algorithms in NumPy by abusing as_strided

I was wondering if it is possible to write an iterative algorithm without using a for loop using as_strided and some operation that edits the memory in place.
For example, if I want to write an algorithm that replaces a number in an array with the sum of its neighbors. I came up with this abomination (yep its summing an element with 2 right neighbors but its just to get an idea):
import numpy as np
a = np.arange(10)
ops = 2
a_view_window = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2, 3), strides=(0,) + 2*a.strides)
a_view = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2), strides=(0,) + a.strides)
np.add.reduce(a_view_window, axis = -1, out=a_view)
print(a)
So I am taking an array of 10 numbers and creating this strange view which increases dimensionality without changing the strides. Thus my thinking is the reduction it will run over the fake new dimension and write over the previous values thus when it gets to the next major dimension it will have to read from the data it overwrote and thus iteratively perform the addition.
Sadly this does not work :(
(yes I know this is a terrible way to do things but I am curious about how the underlying numpy stuff works and if it can be hacked in this way)
This code results in an undefined behavior prior to Numpy 1.13 and works out-of-place in newer versions so to avoid overlapping/aliasing issues. Indeed, you cannot assume Numpy iterate in a given order on the input/output array view. In fact, Numpy often use SIMD instructions to speed up the code and sometimes tell compilers that views are not overlapping/aliasing each other (using the restrict keyword) to they can generate a much more efficient code. For more information you can read the doc on ufuncs (and this issue):
Operations where ufunc input and output operands have memory overlap produced undefined results in previous NumPy versions, due to data dependency issues. In NumPy 1.13.0, results from such operations are now defined to be the same as for equivalent operations where there is no memory overlap.
Operations affected now make temporary copies, as needed to eliminate data dependency. As detecting these cases is computationally expensive, a heuristic is used, which may in rare cases result to needless temporary copies. For operations where the data dependency is simple enough for the heuristic to analyze, temporary copies will not be made even if the arrays overlap, if it can be deduced copies are not necessary.

FFT Speed on Noncubic meshes

I need to repeatedly take the Fourier Transform/Inverse Fourier Transform of a 3d function in order to solve a differential equation. Something like:
import pyfftw.interfaces.numpy_fft as fftw
for i in range(largeNumber):
fFS = fftw.rfftn(f)
# Do stuff
f = fftw.irfftn(fFS)
The shape of f is highly noncubic. Is there any performance difference based on the order of dimensions, for example (512, 32, 128) vs (512, 128, 32), etc.?
I am looking for any speed ups available. I have already tried playing around with wisdom. I thought it might be fastest if the largest dimension went last (e.g. 32, 128, 512) so that fFS.shape = (32, 128, 257), but this doesn't appear to be the case.
If you really want to squeeze all the performance out you can, use the FFTW object directly (most easily accessed through pyfftw.builders). This way you get careful control over exactly what copies occur and whether the normalization is performed on inverse.
Your code as-is will likely benefit from using the cache (enabled by calling pyfftw.interfaces.cache.enable()), which minimises the set up time for the general and safe case, though doesn't eliminate it.
Regarding the best arrangement of dimensions, you'll have to suck it and see. Try all the various options and see what is fastest (with timeit). Make sure when you do the tests you're actually using the data arranged in memory as expected and not just taking a view of the same array in memory (which pyfftw may well handle fine without a copy - though there are tweak parameters for this sort of thing).
FFTW tries lots of different options (different algorithms over different FFT representations) and picks the fastest, so you end up with non-obvious implementations that may well change for different datasets that are superficially very similar.
General tips:
Turn on the multi-threading for maximum performance (set threads=N where appropriate).
Make sure your arrays are suitably byte aligned - this has less impact than it used to with modern hardware, but will probably make a difference (particularly if all your higher dimension sizes have the byte alignment as a factor).
Read the tutorial and the api docs.

Techniques for working with large Numpy arrays? [duplicate]

This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?
I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.
First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.
The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.
If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.
On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.

Categories