I'm toying with a problem modelled by a linear system, which can be written as a square block-tridiagonal matrix. These blocks are of size b = 4n+8, and the full matrix is of size Nb; N could be arbitrarily large (reasonably, of course) while n is kept rather small (typically less than 10).
The blocks themselves are sparse, the first diagonal being only identity matrices, and the second diagonals having only n+1 non-zero columns (so 3n+7 columns of zeroes) per block. These columns are contiguous, either zeroes then non-zeroes or the other way around.
Building all these blocks in memory results in a 3N-2 x b x b array that can be turned into a sparse matrix with scipy.sparse.bsr_matrix, then cast to CSR format and trimmed of the excess zeroes. It works nicely but I'd rather skip this temporary large and sparse array (for N = 1e4, n = 5 it's 5.6 zeros for every relevant entry!) altogether.
I had a look at scipy.sparse.dok_matrix, recommended for slicing and incremental building. Creating my entries fits in a tidy loop but the process is ~10 times longer than using bsr_matrix with my unnecessary dense array, which will be detrimental to the future use cases.
It doesn't seem like bsr_matrix can be used directly with scipy sparse matrices as input.
Using bsr_matrix without including the diagonal blocks, then adding a sparse eye greatly reduces the number of zeros (3.5 per relevant entry in my test configuration) and speeds up the process by a third compared to the original solution. Score!
Any clue on things that I could do to further reduce the original imprint of this matrix? The obvious goal being to give me more freedom with the choice of N.
EDIT
I managed to improved things a tad more by constructing the three block-diagonals separately. By doing so, I need less padding for my 2nd diagonals (n+3 instead of 3n+7; 1.3 zeroes per relevant entry), dividing my original blocks into two vertical blocks (one full of zeroes) and I only need it one diagonal at a time, cutting the memory cost in half on top of that. The main diagonal remains constructed with the eye method. The icing on the cake: a speed up of 25% compared to my 3rd bullet point, probably because separating the two 2nd diagonals saves some array reshaping operations needed before using bsr_matrix. Compared to the original method, for my (N, n) = (1e4, 5) test case it's ~20M zeroes saved when comparing matrices before trimming. At 128 bits each, it's a decent gain already!
The only possible improvement that I can picture now is building these diagonals separately, without any padding, then inserting columns of zeros (probably via products with block-matrices of identities) and finally adding everything together.
I also read something about using a dict to update an empty dok_matrix, but in my case I think I would need to expand lists of indices, take their Cartesian product to construct the keys and each element of my blocks would need to be an individual value as one apparently cannot use slices as dictionary keys.
I ended up implementing the solution I proposed in my last paragraph.
For each 2nd diagonal, I construct a block sparse matrix without any padding, then I transform it into a matrix of the proper shape by a right-hand side product with a block matrix whose blocks are identity. I do need to store zeroes here to use bsr_matrix (I first gave a try to the method scipy.sparse.block_diag, but it was extremely slow), but less of them compared to my semi-padding solution: (4n+7)(n+1) vs (4n+8)(n+3); and they can be represented with 8 bits instead of 128. Execution time is increased by ~40% but I can live with that (and it's still a decrease of 20% compared to the first solution).
I might be missing something here, but for now I'm pretty satisfied with this solution.
EDIT
When trimming the zeroes of the RHS matrices before effecting the product, the execution time is reduced by 30% compared to the previously most efficient solution; all's well that ends well.
Related
I have a tensor of example shape (543, 133, 3), meaning 543 frames, with 133 points of X,Y,Z
I would like to run a savgol_filter on every point in every dimension, however, naively, this is quite slow:
points, frames, dims = tensor.shape
new_data = []
for point in range(points):
new_dims = []
for dim in range(dims):
new_dims.append(scipy.signal.savgol_filter(data[point, :, dim], 3, 1))
new_data.append(new_dims)
tensor = np.array(new_data)
On my computer, for this small tensor, this takes 300ms, which is quite a long time.
Is there a way to make this faster?
This is by no means the fastest method, but it should be quite a lot faster than what you're currently doing. We can utilize vectorized operations instead of for loops to achieve much better performance.
From your code, it seems like you want to smooth the 133 dimension (dim 1) and so you could apply SavGol all at once with
savgol_filter(data, 3, 1, axis=1). In general you can specify the axis on which you'd like to apply the filter. On my computer, this brought the computation from 500ms to 2ms.
A side note: Since you care about performance, I would pay attention to what your data order is. Depending on what you're doing, it might be advisable to reorder your data once to save time.
For example: Let's say you have a matrix of 5 signals (5x299). If you wanted to get a single signal. That's easy! Try signal[0]. This doesn't actually require copying the data and we can just "view" it in memory. But what if you wanted to select a particular band in the signal? If you do signal[:,0] then you can't take a "view" of the memory because first you need to access every signal and take that index. If you had first transposed the matrix, then the first index is just the band of every spectra that you want -- no need for iteration. Data order can be an important part of getting the best performance out of your computations.
There are two related concepts here: contiguous memory and vectorized operations. My explanation of why data order is important has some more complications, and you will need to do your own research to determine what data ordering will give you the best performance for your application. The big things to watch out for are C v Fortran contiguous memory layout.
Here are some resources I found: (not an endorsement)
StackOverflow article on contigous memory: What is the difference between contiguous and non-contiguous arrays?
Towards Data Science article on vectorized operations https://towardsdatascience.com/vectorization-must-know-technique-to-speed-up-operations-100x-faster-50b6e89ddd45
I have 400 2x2 numpy matrices, and I need to sum them all together. I was wondering if there's any better way to do it than using a for loop, as it consumes a lot of time and memory to iterate through a for loop, particularly if I have more matrices (which might be the case in the future) ?
Just figured it out. All my matrices were on a list, so I used
np.sum(<list>, axis=0)
And it gives me the resultant 2x2 matrix of the sum of all the 400 matrices!
I'd like to read 2048 randomly chosen rows of a stored numpy matrix of column size 200 within 100ms. So far I've tried with h5py. In my case, contiguous mode works faster than chunks, and for various other reasons I'm trying with the former. Writing (in a certain more orderly way) is very fast (~3ms); unfortunately, reading 2048 randomly chosen rows takes about 250ms. The reading part I'm trying is as follows:
a = f['/test']
x = []
for i in range(2048):
r = random.randint(1,2048)
x.append(a[[r],...])
x = np.concatenate(x, 0)
Clearly, the speed bottleneck is due to accessing 'a' for 2048 times because I don't know whether there exists an one-shot way of accessing to random rows or not. np.concatenate consumes a negligible amount of time. Since the matrix eventually reaches to the size of (2048*100k, 200), I probably can't use a method other than contiguous h5py. I've tried with a smaller maximal matrix size, but it didn't affect the computation time at all. For reference, the following is the entire task I'm trying to achieve as a part of deep reinforcement learning algorithm:
Generate a numpy array of size (2048, 200)
Write it onto the next available 2048 rows in an extendable list (None, 200)
Randomly pick 2048 rows from the filled rows of the extendable list (irrespective of the generated chunk in the step 1)
Read the picked rows
Continue 1-4 for 100k times (so the total list size becomes (2048*100k, 200))
If rows can be selected more than once, I would try with:
random.choices(a, k=2048)
Otherwise, using:
random.sample(a, 2048)
Both methods will return a list of numpy arrays if a is an numpy ndarray.
Furthermore, if a is already a numpy array why not take advantage of numpy's slicing capabilities and shorten your code to:
x.append(a[np.randint(1, 2048, 2048)])
That way a is still accessed multiple time, but it is all done in optimized C code, which should be faster.
Hope those points you in the right direction.
This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?
I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.
First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.
The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.
If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.
On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.
I have a large matrix (approx. 80,000 X 60,000), and I basically want to scramble all the entries (that is, randomly permute both rows and columns independently).
I believe it'll work if I loop over the columns, and use randperm to randomly permute each column. (Or, I could equally well do rows.) Since this involves a loop with 60K iterations, I'm wondering if anyone can suggest a more efficient option?
I've also been working with numpy/scipy, so if you know of a good option in python, that would be great as well.
Thanks!
Susan
Thanks for all the thoughtful answers! Some more info: the rows of the matrix represent documents, and the data in each row is a vector of tf-idf weights for that document. Each column corresponds to one term in the vocabulary. I'm using pdist to calculate cosine similarities between all pairs of papers. And I want to generate a random set of papers to compare to.
I think that just permuting the columns will work, then, because each paper gets assigned a random set of term frequencies. (Permuting the rows just means reordering the papers.) As Jonathan pointed out, this has the advantage of not making a new copy of the whole matrix, and it sounds like the other options all will.
You should be able to reshape the matrix to a 1 × 4800000000 "array", randperm it, and finally reshape it back to a 80000 × 60000 matrix.
This will require copying the 4.8 billion entries 3 times at worst. This might not be efficient.
EDIT: Actually Matlab automatically uses linear indexing, so the first reshape is not needed. Just
reshape(x(randperm(4800000000), 80000, 60000))
is enough (thus reducing 1 unnecessary potential copying).
Note that, this assumes you have a dense matrix. If you have a sparse matrix, you could extract the values, and then randomly reassign indices to them. If there are N nonzero entries, then only 8N copying are needed at worst (3 numbers are required to describe one entry).
I think it would be better to do this:
import numpy as np
flat = matrix.ravel()
np.random.shuffle(flat)
You are basically flattening the matrix to a list, shuffling the list, and then re-constructing a matrix out of the list.
Both solutions above are great, and will work, but I believe both will involve making a completely new copy of the entire matrix in memory while doing the work. Since this is a huge matrix, that's pretty painful. In the case of the MATLAB solution, I think you'll be possibly creating two extra temporary copies, depending on how reshape works internally. I think you were on the right track by operating on columns, but the problem is that it will only scramble along columns. However, I believe if you do randperm along rows after that, you'll end up with a fully permuted matrix. This way you'll only be creating temporary variables that are, at worst, 80,000 by 1. Yes, that's two loops with 60,000 and 80,000 iterations each, but internally that's going to have to happen regardless. The algorithm is going to have to visit each memory location at least twice. You could probably do a more efficient algorithm by writing a C MEX function that operates completely in place, but I assume you'd rather not do that.