Optimize testing all combinations of rows from multiple NumPy arrays

Optimize testing all combinations of rows from multiple NumPy arrays - python

I have three NumPy arrays of ints, same number of columns, arbitrary number of rows each. I am interested in all instances where a row of the first one plus a row of the second one gives a row of the third one ([3, 1, 4] + [1, 5, 9] = [4, 6, 13]).
Here is a pseudo-code:
for i, j in rows(array1), rows(array2):
if i + j is in rows(array3):
somehow store the rows this occured at (eg. (1,2,5) if 1st row of
array1 + 2nd row of array2 give 5th row of array3)
I will need to run this for very big matrices so I have two questions:
(1) I can write the above using nested loops but is there a quicker way, perhaps list comprehensions or itertools?
(2) What is the fastest/most memory-efficient way to store the triples? Later I will need to create a heatmap using two as coordinates and the first one as the corresponding value eg. point (2,5) has value 1 in the pseudo-code example.
Would be very grateful for any tips - I know this sounds quite simple but it needs to run fast and I have very little experience with optimization.
edit: My ugly code was requested in comments
import numpy as np
#random arrays
A = np.array([[-1,0],[0,-1],[4,1], [-1,2]])
B = np.array([[1,2],[0,3],[3,1]])
C = np.array([[0,2],[2,3]])
#triples stored as numbers with 2 coordinates in a otherwise-zero matrix
output_matrix = np.zeros((B.shape[0], C.shape[0]), dtype = int)
for i in range(A.shape[0]):
for j in range(B.shape[0]):
for k in range(C.shape[0]):
if np.array_equal((A[i,] + B[j,]), C[k,]):
output_matrix[j, k] = i+1
print(output_matrix)

We can leverage broadcasting to perform all those summations and comparison in a vectorized manner and then use np.where on it to get the indices corresponding to the matching ones and finally index and assign -
output_matrix = np.zeros((B.shape[0], C.shape[0]), dtype = int)
mask = ((A[:,None,None,:] + B[None,:,None,:]) == C).all(-1)
I,J,K = np.where(mask)
output_matrix[J,K] = I+1

(1) Improvements
You can use sets for the final result in the third matrix, as a + b = c must hold identically. This already replaces one nested loop with a constant-time lookup. I will show you an example of how to do this below, but we first ought to introduce some notation.
For a set-based approach to work, we need a hashable type. Lists will thus not work, but a tuple will: it is an ordered, immutable structure. There is, however, a problem: tuple addition is defined as appending, that is,
(0, 1) + (1, 0) = (0, 1, 1, 0).
This will not do for our use-case: we need element-wise addition. As such, we subclass the built-in tuple as follows,
class AdditionTuple(tuple):
def __add__(self, other):
"""
Element-wise addition.
"""
if len(self) != len(other):
raise ValueError("Undefined behaviour!")
return AdditionTuple(self[idx] + other[idx]
for idx in range(len(self)))
Where we override the default behaviour of __add__. Now that we have a data-type amenable to our problem, let's prepare the data.
You give us,
A = [[-1, 0], [0, -1], [4, 1], [-1, 2]]
B = [[1, 2], [0, 3], [3, 1]]
C = [[0, 2], [2, 3]]
To work with. I say,
from types import SimpleNamespace
A = [AdditionTuple(item) for item in A]
B = [AdditionTuple(item) for item in B]
C = {tuple(item): SimpleNamespace(idx=idx, values=[])
for idx, item in enumerate(C)}
That is, we modify A and B to use our new data-type, and turn C into a dictionary which supports (amortised) O(1) look-up times.
We can now do the following, eliminating one loop altogether,
from itertools import product
for a, b in product(enumerate(A), enumerate(B)):
idx_a, a_i = a
idx_b, b_j = b
if a_i + b_j in C: # a_i + b_j == c_k, identically
C[a_i + b_j].values.append((idx_a, idx_b))
Then,
>>>print(C)
{(2, 3): namespace(idx=1, values=[(3, 2)]), (0, 2): namespace(idx=0, values=[(0, 0), (1, 1)])}
Where for each value in C, you get the index of that value (as idx), and a list of tuples of (idx_a, idx_b) whose elements of A and B together sum to the value at idx in C.
Let us briefly analyse the complexity of this algorithm. Redefining the lists A, B, and C as above is linear in the length of the lists. Iterating over A and B is of course in O(|A| * |B|), and the nested condition computes the element-wise addition of the tuples: this is linear in the length of the tuples themselves, which we shall denote k. The whole algorithm then runs in O(k * |A| * |B|).
This is a substantial improvement over your current O(k * |A| * |B| * |C|) algorithm.
(2) Matrix plotting
Use a dok_matrix, a sparse SciPy matrix representation. Then you can use any heatmap-plotting library you like on the matrix, e.g. Seaborn's heatmap.

Related

numpy indexing does not sum inplace if duplicate indices [duplicate]

I have a Numpy array and a list of indices whose values I would like to increment by one. This list may contain repeated indices, and I would like the increment to scale with the number of repeats of each index. Without repeats, the command is simple:
a=np.zeros(6).astype('int')
b=[3,2,5]
a[b]+=1
With repeats, I've come up with the following method.
b=[3,2,5,2] # indices to increment by one each replicate
bbins=np.bincount(b)
b.sort() # sort b because bincount is sorted
incr=bbins[np.nonzero(bbins)] # create increment array
bu=np.unique(b) # sorted, unique indices (len(bu)=len(incr))
a[bu]+=incr
Is this the best way? Is there are risk involved with assuming that the np.bincount and np.unique operations would result in the same sorted order? Am I missing some simple Numpy operation to solve this?

In numpy >= 1.8, you can also use the at method of the addition 'universal function' ('ufunc'). As the docs note:
For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.
So taking your example:
a = np.zeros(6).astype('int')
b = [3, 2, 5, 2]
…to then…
np.add.at(a, b, 1)
…will leave a as…
array([0, 0, 2, 1, 0, 1])

After you do
bbins=np.bincount(b)
why not do:
a[:len(bbins)] += bbins
(Edited for further simplification.)

If b is a small subrange of a, one can refine Alok's answer like this:
import numpy as np
a = np.zeros( 100000, int )
b = np.array( [99999, 99997, 99999] )
blo, bhi = b.min(), b.max()
bbins = np.bincount( b - blo )
a[blo:bhi+1] += bbins
print a[blo:bhi+1] # 1 0 2

Calculating new entries in array based on entries from another array in python

I have a question based an how to "call" a specific cell in an array, while looping over another array.
Assume, there is an array a:
a = [[a1 a2 a3],[b1 b2 b3]]
and an array b:
b = [[c1 c2] , [d1 d2]]
Now, I want to recalculate the values in array b, by using the information from array a. In detail, each value of array b has to be recalculated by multiplication with the integral of the gauss-function between the borders given in array a. but for the sake of simplicity, lets forget about the integral, and assume a simple calculation is necessary in the form of:
c1 = c1 * (a2-a1) ; c2 = c2 * (a3 - a2) and so on,
with indices it might look like:
b[i,j] = b[i,j] * (a[i, j+1] - a[i,j])
Can anybody tell me how to solve this problem?
Thank you very much and best regards,
Marc

You can use zip function within a nested list comprehension :
>>> [[k*(v[1]-v[0]) for k,v in zip(v,zip(s,s[1:]))] for s,v in zip(a,b)]
zip(s,s[1:]) will gave you the desire pairs of elements that you want, for example :
>>> s =[4, 5, 6]
>>> zip(s,s[1:])
[(4, 5), (5, 6)]
Demo :
>>> b =[[7, 8], [6, 0]]
>>> a = [[1,5,3],[4 ,0 ,6]]
>>> [[k*(v[1]-v[0]) for k,v in zip(v,zip(s,s[1:]))] for s,v in zip(a,b)]
[[28, -16], [-24, 0]]

you can also do this really cleanly with numpy:
import numpy as np
a, b = np.array(a), np.array(b)
np.diff(a) * b

First I would split your a table in a table of lower bound and one of upper bound to work with aligned tables and improve readability :
lowerBounds = a[...,:-1]
upperBounds = a[...,1:]
Define the Gauss function you provided :
def f(x, gs_wdth = 1., mean=0.):
return 1./numpy.sqrt(2*numpy.pi)*gs_wdth * numpy.exp(-(x-mean)**2/(2*gs_wdth**2))
Then, use a nditer (see Iterating Over Arrays) to efficientely iterate over the arrays :
it = numpy.nditer([b, lowerBounds, upperBounds],
op_flags=[['readwrite'], ['readonly'], ['readonly']])
for _b, _lb, _ub in it:
multiplier = scipy.integrate.quad(f, _lb, _ub)[0]
_b[...] *= multiplier
print b
This does the job required in your post, and should be computationnaly efficient. Note that b in modified "in-place" : original values are lost but there is no memory overshoot during calculation.

Most efficient way to construct a square matrix in numpy using values from a generator

I have a generator g which I know in advance that would return n items. Each item i is of the following structure:
t_i:(e_i, b_i)
t_i is a tuple of variable size, and may contain any ordered subsequence of list (1,...,n). For example, for n=6, t_1=(1, 3, 4), t_2=(2, 4, 6) and so on.
e_i is a number (float/integer), and b_i is a boolean (which is not really used here).
I wonder what is the most efficient way to construct a n x n matrix (using numpy array) using g such that:
Each row i of the matrix corresponds to t_i:(e_i, b_i) in a way that: 1. the row elements (in the matrix) whose positions appear in t_i should be set using e_i; 2. other row elements are default to 0.
So for example, given that row 2 of a 8 x 8 matrix corresponds to item t_2:(e_2, b_2) = (2, 4, 6):(13, True), this row should be then set as (0, 13, 0, 13, 0, 13, 0, 0). Notice that we are not using zero-indexing here for the numbers in t_2 (or t_i in general).
An obvious way is to construct a n x n matrix in advance, and then go through each item return by the generator, and set each row sequentially based on the item. But I feel there must be some more efficient way to do this given the power of Python and that of numpy in particular.

Constructing an n-by-n matrix in Numpy is easy and fairly efficient. By using advanced indexing to set the rows, we can get a pretty simple and efficient implementation:
arr = np.zeros((n,n))
for i,(t,e,b) in enumerate(g):
arr[i,np.array(t) - 1] = e
Note this assumes that g produces tuples of the form (ti, ei, bi).

Acquiring the Minimum array out of Multiple Arrays by order in Python

Say that I have 4 numpy arrays
[1,2,3]
[2,3,1]
[3,2,1]
[1,3,2]
In this case, I've determined [1,2,3] is the "minimum array" for my purposes, as it is one of two arrays with lowest value at index 0, and of those two arrays it has the the lowest index 1. If there were more arrays with similar values, I would need to compare the next index values, and so on.
How can I extract the array [1,2,3] in that same order from the pile?
How can I extend that to x arrays of size n?
Thanks

Using the python non-numpy .sort() or sorted() on a list of lists (not numpy arrays) automatically does this e.g.
a = [[1,2,3],[2,3,1],[3,2,1],[1,3,2]]
a.sort()
gives
[[1,2,3],[1,3,2],[2,3,1],[3,2,1]]
The numpy sort seems to only sort the subarrays recursively so it seems the best way would be to convert it to a python list first. Assuming you have an array of arrays you want to pick the minimum of you could get the minimum as
sorted(a.tolist())[0]
As someone pointed out you could also do min(a.tolist()) which uses the same type of comparisons as sort, and would be faster for large arrays (linear vs n log n asymptotic run time).

Here's an idea using numpy:
import numpy
a = numpy.array([[1,2,3],[2,3,1],[3,2,1],[1,3,2]])
col = 0
while a.shape[0] > 1:
b = numpy.argmin(a[:,col:], axis=1)
a = a[b == numpy.min(b)]
col += 1
print a
This checks column by column until only one row is left.

numpy's lexsort is close to what you want. It sorts on the last key first, but that's easy to get around:
>>> a = np.array([[1,2,3],[2,3,1],[3,2,1],[1,3,2]])
>>> order = np.lexsort(a[:, ::-1].T)
>>> order
array([0, 3, 1, 2])
>>> a[order]
array([[1, 2, 3],
[1, 3, 2],
[2, 3, 1],
[3, 2, 1]])

how to match two numpy array of unequal length?

i have two 1D numpy arrays. The lengths are unequal. I want to make pairs (array1_elemnt,array2_element) of the elements which are close to each other. Lets consider following example
a = [1,2,3,8,20,23]
b = [1,2,3,5,7,21,35]
The expected result is
[(1,1),
(2,2),
(3,3),
(8,7),
(20,21),
(23,25)]
It is important to note that 5 is left alone. It could easily be done by loops but I have very large arrays. I considered using nearest neighbor. But felt like killing a sparrow with a canon.
Can anybody please suggest any elegant solution.
Thanks a lot.

How about using the Needleman-Wunsch algorithm? :)
The scoring matrix would be trivial, as the "distance" between two numbers is just their difference.
But that will probably feel like killing a sparrow with a tank ...

You could use the built in map function to vectorize a function that does this. For example:
ar1 = np.array([1,2,3,8,20,23])
ar2 = np.array([1,2,3,5,7,21,35])
def closest(ar1, ar2, iter):
x = np.abs(ar1[iter] - ar2)
index = np.where(x==x.min())
value = ar2[index]
return value
def find(x):
return closest(ar1, ar2, x)
c = np.array(map(find, range(ar1.shape[0])))
In the example above, it looked like you wanted to exclude values once they had been paired. In that case, you could include a removal process in the first function like this, but be very careful about how array 1 is sorted:
def closest(ar1, ar2, iter):
x = np.abs(ar1[iter] - ar2)
index = np.where(x==x.min())
value = ar2[index]
ar2[ar2==value] = -10000000
return value

The best method I can think of is use a loop. If loop in python is slow, you can use Cython to speedup you code.

I think one can do it like this:
create two new structured arrays, such that there is a second index which is 0 or 1 indicating to which array the value belongs, i.e. the key
concatenate both arrays
sort the united array along the first field (the values)
use 2 stacks: go through the array putting elements with key 1 on the left stack, and when you cross an element with key 0, put them in the right stack. When you reach the second element with key 0, for the first with key 0 check the top and bottom of the left and right stacks and take the closest value (maybe with a maximum distance), switch stacks and continue.
sort should be slowest step and max total space for the stacks is n or m.

You can do the following:
a = np.array([1,2,3,8,20,23])
b = np.array([1,2,3,5,7,21,25])
def find_closest(a, sorted_b):
j = np.searchsorted(.5*(sorted_b[1:] + sorted_b[:-1]), a, side='right')
return b[j]
b.sort() # or, b = np.sort(b), if you don't want to modify b in-place
print np.c_[a, find_closest(a, b)]
# ->
# array([[ 1, 1],
# [ 2, 2],
# [ 3, 3],
# [ 8, 7],
# [20, 21],
# [23, 25]])
This should be pretty fast. How it works is that searchsorted will find for each number a the index into the b past the midpoint between two numbers, i.e., the closest number.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimize testing all combinations of rows from multiple NumPy arrays - python

Related

numpy indexing does not sum inplace if duplicate indices [duplicate]

Calculating new entries in array based on entries from another array in python

Most efficient way to construct a square matrix in numpy using values from a generator

Acquiring the Minimum array out of Multiple Arrays by order in Python

how to match two numpy array of unequal length?

Categories

Resources