I need to repeatedly take the Fourier Transform/Inverse Fourier Transform of a 3d function in order to solve a differential equation. Something like:
import pyfftw.interfaces.numpy_fft as fftw
for i in range(largeNumber):
fFS = fftw.rfftn(f)
# Do stuff
f = fftw.irfftn(fFS)
The shape of f is highly noncubic. Is there any performance difference based on the order of dimensions, for example (512, 32, 128) vs (512, 128, 32), etc.?
I am looking for any speed ups available. I have already tried playing around with wisdom. I thought it might be fastest if the largest dimension went last (e.g. 32, 128, 512) so that fFS.shape = (32, 128, 257), but this doesn't appear to be the case.
If you really want to squeeze all the performance out you can, use the FFTW object directly (most easily accessed through pyfftw.builders). This way you get careful control over exactly what copies occur and whether the normalization is performed on inverse.
Your code as-is will likely benefit from using the cache (enabled by calling pyfftw.interfaces.cache.enable()), which minimises the set up time for the general and safe case, though doesn't eliminate it.
Regarding the best arrangement of dimensions, you'll have to suck it and see. Try all the various options and see what is fastest (with timeit). Make sure when you do the tests you're actually using the data arranged in memory as expected and not just taking a view of the same array in memory (which pyfftw may well handle fine without a copy - though there are tweak parameters for this sort of thing).
FFTW tries lots of different options (different algorithms over different FFT representations) and picks the fastest, so you end up with non-obvious implementations that may well change for different datasets that are superficially very similar.
General tips:
Turn on the multi-threading for maximum performance (set threads=N where appropriate).
Make sure your arrays are suitably byte aligned - this has less impact than it used to with modern hardware, but will probably make a difference (particularly if all your higher dimension sizes have the byte alignment as a factor).
Read the tutorial and the api docs.
Related
I have a tensor of example shape (543, 133, 3), meaning 543 frames, with 133 points of X,Y,Z
I would like to run a savgol_filter on every point in every dimension, however, naively, this is quite slow:
points, frames, dims = tensor.shape
new_data = []
for point in range(points):
new_dims = []
for dim in range(dims):
new_dims.append(scipy.signal.savgol_filter(data[point, :, dim], 3, 1))
new_data.append(new_dims)
tensor = np.array(new_data)
On my computer, for this small tensor, this takes 300ms, which is quite a long time.
Is there a way to make this faster?
This is by no means the fastest method, but it should be quite a lot faster than what you're currently doing. We can utilize vectorized operations instead of for loops to achieve much better performance.
From your code, it seems like you want to smooth the 133 dimension (dim 1) and so you could apply SavGol all at once with
savgol_filter(data, 3, 1, axis=1). In general you can specify the axis on which you'd like to apply the filter. On my computer, this brought the computation from 500ms to 2ms.
A side note: Since you care about performance, I would pay attention to what your data order is. Depending on what you're doing, it might be advisable to reorder your data once to save time.
For example: Let's say you have a matrix of 5 signals (5x299). If you wanted to get a single signal. That's easy! Try signal[0]. This doesn't actually require copying the data and we can just "view" it in memory. But what if you wanted to select a particular band in the signal? If you do signal[:,0] then you can't take a "view" of the memory because first you need to access every signal and take that index. If you had first transposed the matrix, then the first index is just the band of every spectra that you want -- no need for iteration. Data order can be an important part of getting the best performance out of your computations.
There are two related concepts here: contiguous memory and vectorized operations. My explanation of why data order is important has some more complications, and you will need to do your own research to determine what data ordering will give you the best performance for your application. The big things to watch out for are C v Fortran contiguous memory layout.
Here are some resources I found: (not an endorsement)
StackOverflow article on contigous memory: What is the difference between contiguous and non-contiguous arrays?
Towards Data Science article on vectorized operations https://towardsdatascience.com/vectorization-must-know-technique-to-speed-up-operations-100x-faster-50b6e89ddd45
I'm trying to parallelize different tensor operations. I'm aware that tf.vectorized_map and/or tf.map_fn can parallelize input tensor(s) with respect to their first axis, but that's not what I'm looking for. I'm looking for a way to parallelize a for loop on a set of tensors with possibly different shapes.
a = tf.ones((2))
b = tf.ones((2,2))
list_of_tensors = [a,b*2,a*3]
for t in list_of_tensors:
# some operation on t which may vary depending on its shape
Is there a possible way to parallelize this for loop on GPU with TensorFlow? (I'm open to any other library if possible i.e. JAX, numba etc.)
Thanks!
According to the documentation,
The shape and dtype of any intermediate or output tensors in the
computation of fn should not depend on the input to fn.
I'm struggling with this problem myself. I think the answer is one suggested in the comments: If you know the maximum length that your tensor can have, represent the variable length tensor by the maximum length tensor plus an integer which gives the actual length of the tensor. Whether this will be useful at all depends on the meaning of "any intermediate", because at some point you may still need the result of the actual shorter length tensor in your computation. It's a bit of a tail-chasing exercise. This part of Tensorflow is extremely frustrating, it's very very hacky to get things to work that should be easy, especially in the realm of obtaining true parallelism on the GPU for deterministic matrix algorithms, outside of the context of machine learning.
This might work inside the loop:
tf.autograph.experimental.set_loop_options(
shape_invariants=[(v, tf.TensorShape([None]))]
)
Dear stackoverflow community!
Today I found that on a high-end cluster architecture, an elementwise multiplication of 2 cubes with dimensions 1921 x 512 x 512 takes ~ 27 s. This is much too long since I have to perform such computations at least 256 times for an azimuthal averaging of a power spectrum in the current implementation. I found that the slow performance was mainly due to different stride structures (C in one case and FORTRAN in the other). One of the two arrays was a newly generated boolean grid (C order) and the other one (FORTRAN order) came from the 3D numpy.fft.fftn() Fourier transform of an input grid (C order). Any reasons why numpy.fft.fftn() changes the strides and ideas on how to prevent that except for reversing the axes (which would be just a workaround)? With similar strides (ndarray.copy() of the FT grid), ~ 4s are achievable, a tremendous improvement.
The question is therefore as following:
Consider the array:
ran = np.random.rand(1921, 512, 512)
ran.strides
(2097152, 4096, 8)
a = np.fft.fftn(ran)
a.strides
(16, 30736, 15736832)
As we can see the stride structure is different. How can this be prevented (without using a = np.fft.fftn(ran, axes = (1,0)))? Are there any other numpy array routines that could affect stride structure? What can be done in those cases?
Helpful advice is as usual much appreciated!
You could use scipy.fftpack.fftn (as suggested by hpaulj too) rather than numpy.fft.fftn, looks like it's doing what you want. It is however slightly less performing:
import numpy as np
import scipy.fftpack
ran = np.random.rand(192, 51, 51) # not much memory on my laptop
a = np.fft.fftn(ran)
b = scipy.fftpack.fftn(ran)
ran.strides
(20808, 408, 8)
a.strides
(16, 3072, 156672)
b.strides
(41616, 816, 16)
timeit -n 100 np.fft.fftn(ran)
100 loops, best of 3: 37.3 ms per loop
timeit -n 100 scipy.fftpack.fftn(ran)
100 loops, best of 3: 41.3 ms per loop
Any reasons why numpy.fft.fftn() changes the strides and ideas on how to prevent that except for reversing the axes (which would be just a workaround)?
Computing the multidimensionnal DFT of an array consists in successively computing 1D DTFs over each dimensions. There are two strategies:
Restrict 1D DTF computations to contiguous 1D arrays. As the array is contiguous, problem related to latency/cache misses will be reduced. This strategy has a major drawback: the array is to be transposed between each dimension. It is likely the strategy adopted by numpy.fft. At the end of computations, the array has been transposed. To avoid unnecessary computations, the transposed array is returned and strides are modified.
Enable 1D DDFT computations for strided arrays. This might trigger some problem related to latency. It is the strategy of fftw, avaible through the interface pyfftw. As a result, the output array features the same strides as the input array.
Timing numpy.fftn and pyfftw.numpy.fftn as performed here and there or there will tell you whether FFTW is really the Fastest Fourier Transform in the West or not...
To check that numpy uses the first strategy, take a look at numpy/fft/fftpack.py. At line 81-85, the call to work_function(a, wsave) (i.e. fftpack.cfftf, from FFTPACK, arguments documented there) is enclosed between calls to numpy.swapaxes() performing the transpositions.
scipy.fftpack.fftn does not seem to change the strides... Nevertheless, it seems that it makes use of the first strategy. scipy.fftpack.fftn() calls scipy.fftpack.zfftnd() which calls zfft(), based on zfftf1 which does not seem to handle strided DFTs. Moreover, zfftnd() calls many times the function flatten() which performs the transposition.
Another example: for parallel distributed memory multidimensionnal DFTs, FFTW-MPI uses the first strategy to avoid any MPI communications between processes during 1D DTFs. Of course, functions to transpose the array are not far away and a lot a MPI communications are involved in the process.
Are there any other numpy array routines that could affect stride structure? What can be done in those cases?
You can search the github repository of numpy for swapaxes: this funtion is only used a couple of times. Hence, to my mind, this "change of strides" is particular to fft.fftn() and most numpy functions keep the strides unchanged.
Finally, the "change of strides" is a feature of the first strategy and there is no way to prevent that. The only workaround is to swap the axes at the end of the computation, which is costly. But you can rely on pyfftw since fftw implements the second strategy in a very efficient way. The DFT computations will be faster, and subsequent computations will be faster as well if the strides of the different arrays become consistent.
I'm using scipy to do some image processing job, and I found something quite confusing, that is some functions, say scipy.signal.convolve, scipy.ndimage.filters.convolve, have the same name and functionality, but they belong the different modules of scipy, so I kinda wonder why not just implement them once ?
They do slightly different things, mostly related with how they handle the convolution when the two arrays being convolved don't fully overlap.
scipy.ndimage.filters.convolve always returns an array of the same size as its first parameter. To handle areas near the boundaries, where the second array may not fully overlap with the first, it makes up for those values using one of these options: reflect, constant, nearest, mirror or wrap.
scipy.signal.convolve always pads the arrays with zeros as needed, and gives a return with three options, full, valid or same, which determine the size of the return array, depending on whether values calculated relying on the zero-padding are to be kept or discarded.
This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?
I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.
First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.
The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.
If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.
On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.