numpy elementwise outer product with sparse matrices - python

I want to do the element-wise outer product of three (or four) large 2D arrays in python (values are float32 rounded to 2 decimals). They all have the same number of rows "n", but different number of columns "i", "j", "k".
The resulting array should be of shape (n, i*j*k). Then, I want to sum each column of the result to end up with a 1D array of shape (i*j*k).
np.shape(a) = (75466, 10)
np.shape(b) = (75466, 28)
np.shape(c) = (75466, 66)
np.shape(intermediate_result) = (75466, 18480)
np.shape(result) = (18480)
Thanks to ruankesi and divakar, I got a piece of code that works:
# Multiply first two matrices
first_multi = a[...,None] * b[:,None]
# could use np.einsum('ij,ik->ijk',a,b), which is slightly faster
ab_fills = first_multi.reshape(a.shape[0], a.shape[1]*b.shape[1])
# Multiply the result with the third matrix
second_multi = ab_fills[..., None] * c[:,None]
abc_fills = second_multi.reshape(ab_fills.shape[0], ab_fills.shape[1] * c.shape[1])
# Get the result: sum columns and get a 1D array of length 10*28*66 = 18 480
result = np.sum(abc_fills, axis = 0)
Problem 1: Performance
This takes about 3 seconds, but I have to repeat this operation many times and some of the matrices are even larger (in number of rows). It is acceptable but making it faster would be nice.
Problem 2: My matrices are sparse
Indeed, for instance, "a" contains 70% of 0s. I tried to play with scipy csc_matrix, but really could not get a working version. (to get the element-wise outer product here I go via a conversion to a 3D matrix, which are not supported in scipy sparse_matrix)
Problem 3: memory usage
If I try to also work with a 4th matrix, I run into memory issues.
I imagine that converting this code to sparse_matrix would save a lot of memory, and make the calculation faster by ignoring the numerous 0 values.
Is that true? If yes, can someone help me?
Of course, if you have any suggestion for a better implementation, I am also very interested. I don't need any of the intermediate results, just the final 1D result.
It's been weeks I'm stuck on this part of code, I am going nuts!
Thank you!
Edit after Divakar's answer
Approach #1:
Very nice one liner but surprisingly slower than the original approach (?).
On my test dataset, approach #1 takes 4.98 s ± 3.06 ms per loop (no speedup with optimize = True)
The original decomposed approach took 3.01 s ± 16.5 ms per loop
Approach #2:
Absolutely great, thank you! What an impressive speedup!
62.6 ms ± 233 µs per loop
About numexpr, I try to avoid as much as possible requirements for external modules, and I don't plan to use multicores/threads. This is an "embarrassingly" parallelizable task, with hundreds of thousands of objects to analyze, I'll just spread the list across available CPUs during production. I will give it a try for memory optimization.
As a brief try of numexpr with a restriction for 1 thread, performing 1 multiplication, I get a runtime of 40ms without numexpr, and 52 ms with numexpr.
Thanks again!!

Approach #1
We can use np.einsum to do sum-reductions in one go -
result = np.einsum('ij,ik,il->jkl',a,b,c).ravel()
Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.
Approach #2
We can use broadcasting to do the first step as also mentioned in the posted code and then leverage tensor-matrix-multiplcation with np.tensordot -
def broadcast_dot(a,b,c):
first_multi = a[...,None] * b[:,None]
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
We can also use numexpr module that supports multi-core processing and also achieves better memory efficiency to get first_multi. This gives us a modified solution, like so -
import numexpr as ne
def numexpr_broadcast_dot(a,b,c):
first_multi = ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
Timings on random float data with given dataset sizes -
In [36]: %timeit np.einsum('ij,ik,il->jkl',a,b,c).ravel()
4.57 s ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit broadcast_dot(a,b,c)
270 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit numexpr_broadcast_dot(a,b,c)
172 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to give a sense of improvement with numexpr -
In [7]: %timeit a[...,None] * b[:,None]
80.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
25.9 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This should be substantial when extending this solution to higher number of inputs.

Related

Numba in nonpython mode is much slower than pure python (no print statements or specified numpy functions)

I have recently discovered that Numba may work much slower than pure python even in non-python mode with the parrallel=True option enabled.
Important: If you don't deal with Voronoi diagrams please continue reading, my question doesn't relate to them directly.
Currently, I am working on a problem where I have energy associated with the Voronoi diagram's edges and cells areas. The scipy Vornoi returns an array containing couples of points (vor.ridge_points) associated with each Vornoi edge. For my code, I want to have the ability to get index of the edge when providing indexes of the associated points, so I define a kind of adjacency matrix, but instead of ones and zeros, it has zeros and indexes of edges.
It turns out that pure python when performing cycles over numpy arrays turns to be 10 times faster than numba. Here is a toy example (i just randomly generated arrays, for the same number of edges and points as in my simulation).
My guess that it has something to do with memory allocation. Any take on the subject would be apprectiated (the reason why is it so much slower or a better way to get edge number from numbers of points) :)
# %%
from numba.np.ufunc import parallel
import numpy as np
from numba import njit
from numba import prange
# %% generating array that models array og ridges
points_number = 8802
ridges_number = 26379
np.random.seed(123)
ridge_points = np.random.randint(points_number, size=(ridges_number, 2))
# %% symmetric matrix containing indexes of all edges
# in space [original_point_1, original_point_2]
ridge_points = np.array(ridge_points, dtype=np.int32)
#njit(parallel=True, cache=True)
def jit_edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in prange(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = jit_edges_matrix_op(ridge_points, ridges_number)
# %% the same but not jitted
def edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in range(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
jit_edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
edges_matrix_op(ridge_points, ridges_number)
UPDATE
Indeed parallelization is not working properly here, so I run tests with parallel=False. Here are the results
630 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=True
553 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=False
66.5 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) - pure python
UPDATE 2
Thanks to max9111 sharing a link https://github.com/numba/numba/issues/7259
There seems to be an issue with allocating large arrays with zeros (np.zeros)
The issue has been reported a couple of weeks ago, and the link contains some workaround examples.
I tried allocating np.empty()
29.7 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=True
44.7 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=False
60.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- pure python
And as you can see parallelized numba works the best, so this task is parallizable and overhead is not that big
I think the crucial issue is related to what is the nature of the calculations you are performing in this part of the code:
for i in range(ridges_number):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
Loop calculations are performing best if they are cache-local and trivially paralellizable (i.e. calculations are independent in each loop).
In your case, both of the conditions are violated. e1 and e2 do not take consecutive values across all of the loops. Similarly, the matrix r_p is likely preventing efficient paralellization because it needs to be accessed by all of the threads in each of the loops and probably it is locked by one while being accesed by all others).
All in all, the function you chose to speed-up may suffer the overhead of paralellization while in effect the calculations are executed sequentially. And the calculations, at least as they are at the moment, are inherently difficult to speed up by parallelization.

Numpy - most efficient way to create an array from a list of arrays

I have a program whose current performance bottleneck involves creating a new array from a relatively short list of relatively long, flat arrays:
num_arrays = 5
array_length = 1000
arrays = [np.random.random((array_length, )) for _ in range(num_arrays)]
new_array = np.array(arrays)
In other words, stacking n arrays of shape (s,) into new_array of shape (n, s).
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this:
%timeit np.array(arrays)
>>> 3.6 µs ± 67.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.stack(arrays)
>>> 9.61 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I am currently using np.array(arrays), but I am wondering if there is a more efficient way to do this.
Some details which might help:
The length of the arrays is fixed throughout the runtime of the program, e.g. 1000.
The number of arrays is usually low, usually <=5. It is possible to get the upper bound for this at checkpoints throughout the run of the program (i.e. every ~1000 creation of such arrays), but not in advance.

Is there a list of big O complexities for the numpy library?

I'm doing a time complexity analysis of an algorithm and need to know what kind of complexities certain numpy operations have.
For some, I assume they match the underlying mathematical operation. Like np.dot(array1, array2) would be O(n). For others, I am not as sure. For example, is np.array(my_array) O(1)? or is it O(n)? Does it simply reassign a pointer or is it iterating over the list and copying out each value?
I want to be sure of each operation's complexity. Is there somewhere I can find this information? Or should I just assume they match the mathematical operation?
BigO complexity is not often used with Python and numpy. It's a measure of how the code scales with problem size. That's useful in a compiled language like C. But here the code is a mix of interpreted Python and compiled code. Both can have the same bigO, but the interpreted version will be orders of magnitude slower. That's why most of the SO questions about improving numpy speed, talk about 'removing loops' and 'vectorizing'.
Also few operations are pure O(n); most are a mix. There's a setup cost, plus a per element cost. If the per element cost is small, the setup cost dominates.
If starting with lists, it's often faster to iterate on the list, because converting a list to an array has a substantial overhead (O(n)).
If you already have arrays, then avoid (python level) iteration where possible. Iteration is part of most calculations, but numpy lets you do a lot of that in faster compiled code (faster O(n)).
At some point you have to understand how numpy stores its arrays. The distinction between view and copy is important. A view is in effect O(1), a copy O(n).
Often you'll see SO answers do timeit speed comparisons. I often add the caution that results might vary with problem size. The better answers will time various size problems, and show the results on a nice plot. The results are often a mix of straight lines (O(n)), and curves (varying blends of O(1) and O(n) components).
You asked specifically about np.array. Here are some sample timings:
In [134]: %%timeit alist = list(range(1000))
...: np.array(alist)
67.9 µs ± 839 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [135]: %%timeit alist = list(range(10))
...: np.array(alist)
2.19 µs ± 9.88 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [136]: %%timeit alist = list(range(2000))
...: np.array(alist)
134 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
copy an array:
In [137]: %%timeit alist = list(range(2000)); arr=np.array(alist)
...: np.array(arr)
1.77 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
no copy:
In [138]: %%timeit alist = list(range(2000)); arr=np.array(alist)
...: np.array(arr, copy=False)
237 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
from a list of strings:
In [139]: %%timeit alist = [str(i) for i in range(2000)]
...: np.array(alist, dtype=int)
286 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Almost all calculations in numpy are O(n). If it involves each element of an array it, speed will depend on the size of the array. Some array manipulations are O(1), such as reshaping, because they don't actually do anything with the data; they change properties like shape and strides.
Search problems often grow faster than O(n); usually numpy is not the best choice for that kind of problem. Smart of use Python lists and dictionaries can be faster.
For the specific example np.array(my_array) as it needs to run through all the elements of my_array, allocate memory and initialize the values, it takes place in linear time.
There is a python module big_O that can be used to analyze the complexity of a function from its execution time.
Refer to this link for more information

Why doesn't numpy.zeros allocate all of its memory on creation? And how can I force it to?

I want to create an empty Numpy array in Python, to later fill it with values. The code below generates a 1024x1024x1024 array with 2-byte integers, which means it should take at least 2GB in RAM.
>>> import numpy as np; from sys import getsizeof
>>> A = np.zeros((1024,1024,1024), dtype=np.int16)
>>> getsizeof(A)
2147483776
From getsizeof(A), we see that the array takes 2^31 + 128 bytes (presumably of header information.) However, using my task manager, I can see Python is only taking 18.7 MiB of memory.
Assuming the array is compressed, I assigned random values to each memory slot so that it could not be.
>>> for i in range(1024):
... for j in range(1024):
... for k in range(1024):
... A[i,j,k] = np.random.randint(32767, dtype = np.int16)
The loop is still running, and my RAM is slowly increasing (presumably as the arrays composing A inflate with the incompresible noise.) I'm assuming it would make my code faster to force numpy to expand this array from the beginning. Curiously, I haven't seen this documented anywhere!
So, 1. Why does numpy do this? and 2. How can I force numpy to allocate memory?
A neat answer to your first question can also be found in this StackOverflow answer.
To answer your second question, you can force the memory to be allocated as follows in a more or less efficient manner:
A = np.empty((1024,1024,1024), dtype=np.int16)
A.fill(0)
because then the memory is touched.
At my machine with my setup,
A = np.empty(0)
A.resize((1024, 1024, 1024))
also does the trick, but I cannot find this behavior documented, and this might be an implementation detail; realloc is used under the hood in numpy.
Let's look at some timings for a smaller case:
In [107]: A = np.zeros(10000,int)
In [108]: for i in range(A.shape[0]): A[i]=np.random.randint(327676)
We don't need to make A 3d to get the same effect; 1d of the same total size would be just as good.
In [109]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676)
37 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now compare that time to the alternative of generating the random numbers with one call:
In [110]: timeit np.random.randint(327676, size=A.shape)
185 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Much much faster.
If we do the same loop, but simply assign the random number to a variable (and throw it away):
In [111]: timeit for i in range(A.shape[0]): x=np.random.randint(327676)
32.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The times are nearly the same as the original case. Assigning the values to the zeros array is not the big time consumer.
I'm not testing a very large case as you are, and my A has already been initialized in full. So you are welcome repeat the comparisons with your size. But I think the pattern will still hold - iteration 1024x1024x1024 times (100,000 larger than my example) is the big time consumer, not the memory allocation task.
Something else you might experimenting with: just iterate on the first dimension of A, and assign randomint shaped like the other 2 dimensions. For example, expanding my A with a size 10 dimension:
In [112]: A = np.zeros((10,10000),int)
In [113]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676,size=A.shape[1])
1.95 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A is 10x larger than in [107], but take 16x less time to fill, because it only as to iterate 10x. In numpy if you must iterate, try to do it a few times on a more complex task.
(timeit repeats the test many times (e.g. 7*10), so it isn't going to capture any initial memory allocation step, even if I use a large enough array for that to matter).

Why is Numpy faster at creating 3D arrays than 2D arrays?

From timing the creation of Nx4096x4096 arrays, it appears Numpy does it much faster when N = 2 or 3 than N = 1:
import numpy as np
%timeit a = np.zeros((2, 4096, 4096), dtype=np.float32, order='C')
5.24 µs ± 98.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a = np.zeros((4096, 4096), dtype=np.float32, order='C')
23.4 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The difference is shocking. Why is that so and how to make the case when N = 1 at least as fast as when N > 1? Could the "%timeit" be simply wrong for timing this?
Context: I need to create another single array of 4096 x 4096 with a different type (uint8), and I'm trying to get the fastest Pythonic (or Numpy-related) implementation. The Nx4096x4096 array wil be populated with non-zeros values from a 3-column array (read from a file) where the 1st column are 1D coordinates and 2nd and 3rd column are the intensity values for the 1st and 2nd image (hence the N=2 case). Using sparse matrix is for now not an option.
There are 130 million of such files. So the above is happening as many times.
[EDIT] This is under Python 3.6.4, numpy 1.14 under macOS Sierra. Same version under Windows do not reproduce the same behavior. The np.zeros() for the smaller array take half the time than the twice-larger array. From the comments and the mentionned duplicate question I understand this can be due to thresholds in memory allocations. This does however defeat the purpose of %timeit.
[EDIT 2] Regarding the duplicate question, the question here should be now more about how to time this function properly, without having to write extra code that will access the variable so the OS actually allocates the memory. Wouldn't that extra code bias the result of the timing? Isn't there a simple way to profile this?

Categories