Grouping elements of a NumPy array by sum of indices - python

I have several large numpy array of dimensions 30*30*30, on which I need to traverse the array, get the sum of each index triplet and bin these elements by this sum. For example, consider this simple 2*2 array:
test = np.array([[2,3],[0,1]])
This array has the indices [0,0],[0,1],[1,0] and [1,1]. This routine would return the list: [2,[3,0],1], because 2 in array test has index sum 0, 3 and 0 have index sum 1 and 1 has index sum 2. I know the brute force method of iterating through the NumPy array and checking the sum would work, but it is far too inefficient for my actual case with large N(=30) and several arrays. Any inputs on using NumPy routines to accomplish this grouping would be appreciated. Thank you in advance.

Here is one way that should be reasonably fast, but not super-fast: 30x30x30 takes 20 ms on my machine.
import numpy as np
# make example
dims = 2,3,4
a = np.arange(np.prod(dims),0,-1).reshape(dims)
# create and sort indices
idx = sum(np.ogrid[tuple(map(slice,dims))])
srt = idx.ravel().argsort(kind='stable')
# use order to arrange and split data
asrt = a.ravel()[srt]
spltpts = idx.ravel().searchsorted(np.arange(1,np.sum(dims)-len(dims)+1),sorter=srt)
out = np.split(asrt,spltpts)
# admire
out
# [array([24]), array([23, 20, 12]), array([22, 19, 16, 11, 8]), array([21, 18, 15, 10, 7, 4]), array([17, 14, 9, 6, 3]), array([13, 5, 2]), array([1])]

You could procedural create a list of index tuplets and use that, but may be getting into a code constant that's too large to be efficient.
[(0,0),[(1,0),(0,1)],(1,1)],
So you need a function to generate these indexes on the fly for an n-demensional array.
For one dimension, a trivial count/increment
[(0),(1),(2),...]
The the second, use the one dimension strategy for the fist dimension, the decrement the first and increment the second to fill in.
[(0...)...,(1...)...,(2...)...,...]
[[(0,0)],[(1,0),(0,1)],[(2,0),(1,1),(0,2)],[...],...]
Notice some of these would be outside the example array, Your generator would need to include a bounds check.
Then three dimensions, give the first two demensions the treatment as above, but at the end, decrement the first dimension, increment the third, repeat until done
[[(0,0,0),...],[(1,0,0),(0,1,0),...],[(2,0,0),(1,1,0),(0,2,0),...],[...],...]
[[(0,0,0)],[(1,0,0),(0,1,0),(0,0,1)],[(2,0,0),(1,1,0),(0,2,0),(1,0,1),(0,1,1)(0,0,2)
Again need bounds checks or cleverer starting/end points to avoid trying to access outside the index, but this general algorithm is how you'd go about generating the indexes on the fly rather than having two large arrays compete for cache and i/o.
Generating the python or nympy equivalent is left as an exercise to the user.

Related

Mapping between ids and indices using numpy arrays

I'm working on a graphical application that uses shapes such as quads, trias, lines etc. to represent geometry.
The input data is based on ID's.
A list of points is provided, each with an ID and coordinates (x, y, z)
A list of shapes is provided, each defined using the ids from he list of points
So a tria is defined as N1, N2, N3 where the N's are ID's in the list of points
I'm using VTK to display the data and it uses indices and not ids.
So I have to convert the id based input to index based input and I use the following numpy array approach which works REALLY well (and was provided by someone on this board I think)
# nodes - numpy array of point ids
nodes=np.asarray([1, 15, 56, 101, 150]) # This array can be millions of ids long
nmx = nodes.max()
node_id_to_index = np.empty((nmx + 1,), dtype=np.uint32)
# Using the node id as an index, insert consecutive indices as values into the array
# This gives us an array that can be indexed by ID and return the compact index
node_id_to_index[nodes] = np.arange(len(nodes), dtype=np.uint32)
Now when I have a shape defined using ids I can easily convert it to use indices like this
elems_by_id = np.asarray([56,1,150,15,101,1]) # This array can be millions of ids long
elems_by_index = node_id_to_index[elems_by_id]
# gives [2, 0, 4, 1, 3, 0]
One weakness of the approach is that if the original list of ids contains even a VERY large number, I'm required to allocate an array big enough to hold that many items. Even though I may not have that many entries in the original id list. (The original ID list can have gaps in the ids). I ran into this condition today.....
So my question is - how can I modify this approach to handle lists that contain ids so large that I don't have enough memory to create the mapping array?
Any help will be gratefully received....
Doug
OK - I think I found a solution - Credit to #Paul Panzer
But first some addition info - the input nodes array is sorted and guaranteed to have only unique ids
elems_by_index = nodes.searchsorted(elems_by_id)
This is only marginally slower than the original approach - so I'll just branch based on the max id in nodes - use the original approach when I can easily allocate enough memory and the second approach when the max id is huge....
As I understand, essentially you're looking for a fast way to find the index of a number in a list, e. g., you have a list like:
nodes = [932, 578, 41, ...]
and need a structure that would give
id_to_index[932] == 0
id_to_index[578] == 1
id_to_index[41] == 2
# etc.
(which can be done with something as simple as nodes.index(932), except that wouldn't be any fast). And your current solution is, essentially, the first part of the pigeonhole sort algorithm, and the problem is that the "number of elements (n) and the length of the range of possible key values (N) are approximately the same" condition isn't met - the range is much bigger in your case, so too much memory is wasted on that auxiliary data structure.
Why not simply use a Python dictionary, by the way? E. g. id_to_index = {932: 0, 578: 1, 41: 2, ...} - is it too slow (your current lookup is O(1), with a dictionary it would be log something)? Or is it because you want numpy indexing (e. g. id_to_index[[n1, n2, n3]] instead of one by one)? Perhaps, then, you can use SciPy sparce matrices (a single-row matrix instead of an array):
import numpy as np
import scipy.sparse as sp
nodes = np.array([9, 2, 7]) # a small test sample
# your solution with a numpy array
nmx = nodes.max()
node_id_to_index = np.empty((nmx + 1,), dtype=np.uint32)
node_id_to_index[nodes] = np.arange(len(nodes), dtype=np.uint32)
elems_by_id = [7, 2] # an even smaller test sample
elems_by_index = node_id_to_index[elems_by_id]
# gives [2 1]
print(elems_by_index)
# same with a 1 by (nmx + 1) sparce matrix
m = sp.csr_matrix((1, nmx + 1), dtype = np.uint32) # 1 x 10 matrix, stores nothing
m[0, nodes] = np.arange(len(nodes), dtype=np.uint32) # 1 x 10 matrix, stores 3 elements
m_by_index = m[0, elems_by_id] # looking through the 0-th row
print(m_by_index.toarray()[0]) # also gives [2 1]
Not sure if I chose the optimal type of matrix for this, read the descriptions of different types of sparse matrix formats to find the best one for the task.

Trying to rank two arrays at the same time

I am trying to get a rank of worst case scenarios but there are two different types of worst case scenarios that I am trying to compare. So I have two separate arrays and I am trying to compare them to one another.
I used the sorting method from the link below and it works when sorting with one array but not with two.
Rank items in an array using Python/NumPy, without sorting array twice
CI_SUM_1 = numpy.array([2,1,7,23])
CI_SUM_2 = numpy.array([4,0,22,3])
order = CI_SUM_1.argsort() + CI_SUM_2.argsort()
rank = order.argsort()
print(rank)
In the above example it is adding the ranks together(which makes sense), so I am getting [0,2,1,3]. Which isn't what I am looking for. I am trying to get 8 ranks so I can see individual ranks.
Expected result should be something like [2,1,5,7,4,0,6,3] which is the ranks when putting the two arrays side by side. Basically what I want is an absolute rank not a rank per array. So I only want one 1 unless the two values are the same. I don't want two arrays from 0-3, I want one from 0-7.
You need to concatenate the two arrays CI_SUM_1 and CI_SUM_2 before using argsort such as:
print (np.concatenate([CI_SUM_1,CI_SUM_2]).argsort().argsort())
array([2, 1, 5, 7, 4, 0, 6, 3], dtype=int64)

How to check if all elements of a numpy array are in another numpy array

I have two 2D numpy arrays, for example:
A = numpy.array([[1, 2, 4, 8], [16, 32, 32, 8], [64, 32, 16, 8]])
and
B = numpy.array([[1, 2], [32, 32]])
I want to have all lines from A where I can find all elements from any of the lines of B. Where there are 2 of the same element in a row of B, lines from A must contain at least 2 as well. In case of my example, I want to achieve this:
A_filtered = [[1, 2, 4, 8], [16, 32, 32, 8]]
I have control over the values representation so I chose numbers where the binary representation has only one place with 1 (example: 0b00000001 and 0b00000010, etc...) This way I can easily check if all type of values are in the row by using np.logical_or.reduce() function, but I cannot check that the number of the same elements are bigger or equal in a row of A. I was really hoping that I could avoid simple for loop and deep copies of the arrays as the performance is a very important aspect for me.
How can I do that in numpy in an efficient way?
Update:
A solution from here may work, but I think the performance is a big concern for me, the A can be really big (>300000 rows) and B can be moderate (>30):
[set(row).issuperset(hand) for row in A.tolist() for hand in B.tolist()]
Update 2:
The set() solution is not working since the set() drops all duplicated values.
I hope I got your question right. At least it works with the problem you described in your question. If the order of the output should stay the same as the input, change the inplace-sort.
The code looks quite ugly, but should perform well and shouldn't be to hard to understand.
Code
import time
import numba as nb
import numpy as np
#nb.njit(fastmath=True,parallel=True)
def filter(A,B):
iFilter=np.zeros(A.shape[0],dtype=nb.bool_)
for i in nb.prange(A.shape[0]):
break_loop=False
for j in range(B.shape[0]):
ind_to_B=0
for k in range(A.shape[1]):
if A[i,k]==B[j,ind_to_B]:
ind_to_B+=1
if ind_to_B==B.shape[1]:
iFilter[i]=True
break_loop=True
break
if break_loop==True:
break
return A[iFilter,:]
Measuring performance
####First call has some compilation overhead####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)
t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)
####Let's measure the second call too####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)
t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)
Results
46ms after the first run on a dual-core Notebook (sorting included)
32ms (sorting excluded)
I think this should work:
First, encode the data as follows (this assumes a limited number of 'tokens', as your binary scheme also seems to imply):
Make A shape [n_rows, n_tokens], dtype int8, where each element counts the number of tokens. Encode B in the same way, with shape [n_hands, n_tokens]
This allows for a single vectorized expression of your output; matches = (A[None, :, :] >= B[:, None, :]).all(axis=-1). (exactly how to map this matches array to your desired output format is left as an excerise to the reader since the question leaves it undefined for multiple matches).
But we are talking > 10Mbyte of memory per token here. Even with 32 tokens that should not unthinkable; but in a situation like this it tends to be better to not vectorize the loop over n_tokens or n_hands, or both; for loops are fine for small n, or if there is sufficient work to be done in the body, such that the looping overhead is insignificant.
As long as n_tokens and n_hands remain moderate, I think this will be the fastest solution, if staying within the realm of pure python and numpy.

Python: return the row index of the minimum in a matrix

I wanna print the index of the row containing the minimum element of the matrix
my matrix is matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
and the code
matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
a = np.array(matrix)
buff_min = matrix.argmin(axis = 0)
print(buff_min) #index of the row containing the minimum element
min = np.array(matrix[buff_min])
print(str(min.min(axis=0))) #print the minium of that row
print(min.argmin(axis = 0)) #index of the minimum
print(matrix[buff_min]) # print all row containing the minimum
after running, my result is
1
3
1
[22, 3, 4, 12]
the first number should be 2, because the minimum is 2 in the third list ([34,6,4,5,8,2]), but it returns 1. It returns 3 as minimum of the matrix.
What's the error?
I am not sure which version of Python you are using, i tested it for Python 2.7 and 3.2 as mentioned your syntax for argmin is not correct, its should be in the format
import numpy as np
np.argmin(array_name,axis)
Next, Numpy knows about arrays of arbitrary objects, it's optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.
If you really want flexible Numpy arrays, use something like this:
np.array([[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]], dtype=object)
However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).
Also, to mention if you can resize your numpy array thing might work, i haven't tested it, but by the concept that should be an easy solution. But i will prefer use a nested list in this case of input matrix
Does this work?
np.where(a == a.min())[0][0]
Note that all rows of the matrix need to contain the same number of elements.

Applying a probabilistic function over specific element of an array individually with Numpy/Python

I'm fairly new to Python/Numpy. What I have here is a standard array and I have a function which I have vectorized appropriately.
def f(i):
return np.random.choice(2,1,p=[0.7,0.3])*9
f = np.vectorize(f)
Defining an example array:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
With the vectorized function, f, I would like to evaluate f on each cell on the array with a value of 0.
I am trying to leave for loops as a last resort. My arrays will eventually be larger than 100 by 100, so running each cell individually to look and evaluate f might take too long.
I have tried:
print f(array[array==0])
Unfortunately, this gives me a row array consisting of 5 elements (the zeroes in my original array).
Alternatively I have tried,
array[array==0] = f(1)
But as expected, this just turns every single zero element of array into 0's or 9's.
What I'm looking for is somehow to give me my original array with the zero elements replaced individually. Ideally, 30% of my original zero elements will become 9 and the array structure is conserved.
Thanks
The reason your first try doesn't work is because the vectorized function handle, let's call it f_v to distinguish it from the original f, is performing the operation for exactly 5 elements: the 5 elements that are returned by the boolean indexing operation array[array==0]. That returns 5 values, it doesn't set those 5 items to the returned values. Your analysis of why the 2nd form fails is spot-on.
If you wanted to solve it you could combine your second approach with adding the size option to np.random.choice:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
mask = array==0
array[mask] = np.random.choice([18,9], size=mask.sum(), p=[0.7, 0.3])
# example output:
# array([[ 1, 1, 9],
# [18, 1, 9],
# [ 9, 18, 1]])
There was no need for np.vectorize: the size option takes care of that already.

Categories