Mapping between ids and indices using numpy arrays

Mapping between ids and indices using numpy arrays - python

I'm working on a graphical application that uses shapes such as quads, trias, lines etc. to represent geometry.
The input data is based on ID's.
A list of points is provided, each with an ID and coordinates (x, y, z)
A list of shapes is provided, each defined using the ids from he list of points
So a tria is defined as N1, N2, N3 where the N's are ID's in the list of points
I'm using VTK to display the data and it uses indices and not ids.
So I have to convert the id based input to index based input and I use the following numpy array approach which works REALLY well (and was provided by someone on this board I think)
# nodes - numpy array of point ids
nodes=np.asarray([1, 15, 56, 101, 150]) # This array can be millions of ids long
nmx = nodes.max()
node_id_to_index = np.empty((nmx + 1,), dtype=np.uint32)
# Using the node id as an index, insert consecutive indices as values into the array
# This gives us an array that can be indexed by ID and return the compact index
node_id_to_index[nodes] = np.arange(len(nodes), dtype=np.uint32)
Now when I have a shape defined using ids I can easily convert it to use indices like this
elems_by_id = np.asarray([56,1,150,15,101,1]) # This array can be millions of ids long
elems_by_index = node_id_to_index[elems_by_id]
# gives [2, 0, 4, 1, 3, 0]
One weakness of the approach is that if the original list of ids contains even a VERY large number, I'm required to allocate an array big enough to hold that many items. Even though I may not have that many entries in the original id list. (The original ID list can have gaps in the ids). I ran into this condition today.....
So my question is - how can I modify this approach to handle lists that contain ids so large that I don't have enough memory to create the mapping array?
Any help will be gratefully received....
Doug
OK - I think I found a solution - Credit to #Paul Panzer
But first some addition info - the input nodes array is sorted and guaranteed to have only unique ids
elems_by_index = nodes.searchsorted(elems_by_id)
This is only marginally slower than the original approach - so I'll just branch based on the max id in nodes - use the original approach when I can easily allocate enough memory and the second approach when the max id is huge....

As I understand, essentially you're looking for a fast way to find the index of a number in a list, e. g., you have a list like:
nodes = [932, 578, 41, ...]
and need a structure that would give
id_to_index[932] == 0
id_to_index[578] == 1
id_to_index[41] == 2
# etc.
(which can be done with something as simple as nodes.index(932), except that wouldn't be any fast). And your current solution is, essentially, the first part of the pigeonhole sort algorithm, and the problem is that the "number of elements (n) and the length of the range of possible key values (N) are approximately the same" condition isn't met - the range is much bigger in your case, so too much memory is wasted on that auxiliary data structure.
Why not simply use a Python dictionary, by the way? E. g. id_to_index = {932: 0, 578: 1, 41: 2, ...} - is it too slow (your current lookup is O(1), with a dictionary it would be log something)? Or is it because you want numpy indexing (e. g. id_to_index[[n1, n2, n3]] instead of one by one)? Perhaps, then, you can use SciPy sparce matrices (a single-row matrix instead of an array):
import numpy as np
import scipy.sparse as sp
nodes = np.array([9, 2, 7]) # a small test sample
# your solution with a numpy array
nmx = nodes.max()
node_id_to_index = np.empty((nmx + 1,), dtype=np.uint32)
node_id_to_index[nodes] = np.arange(len(nodes), dtype=np.uint32)
elems_by_id = [7, 2] # an even smaller test sample
elems_by_index = node_id_to_index[elems_by_id]
# gives [2 1]
print(elems_by_index)
# same with a 1 by (nmx + 1) sparce matrix
m = sp.csr_matrix((1, nmx + 1), dtype = np.uint32) # 1 x 10 matrix, stores nothing
m[0, nodes] = np.arange(len(nodes), dtype=np.uint32) # 1 x 10 matrix, stores 3 elements
m_by_index = m[0, elems_by_id] # looking through the 0-th row
print(m_by_index.toarray()[0]) # also gives [2 1]
Not sure if I chose the optimal type of matrix for this, read the descriptions of different types of sparse matrix formats to find the best one for the task.

Related

Finding 3D indices of all matching values in numpy

I have a 3D int64 Numpy array, which is output from skimage.measure.label. I need a list of 3D indices that match each of our possible (previously known) values, separated out by which indices correspond to each value.
Currently, we do this by the following idiom:
for cur_idx,count in values_counts.items():
region=labels[:,:,:] == cur_idx
[dim1_indices,dim2_indices,dim3_indices]= np.nonzero(region)
While this code works and produces correct output, it is quite slow, especially the np.nonzero part, as we call this 200+ times on a large array. I realize that there is probably a faster way to do this via, say, numba, but we'd like to avoid adding on additional requirements unless needed.
Ultimately, what we're looking for is a list of indices that correspond to each (nonzero) value relatively efficiently. Assume our number of values <1000 but our array size >100x1000x1000. So, for example, on the array created by the following:
x = np.zeros((4,4,4))
x[3,3,3] = 1; x[1,0,3] = 2; x[1,2,3] = 2
we would want some idx_value dict/array such that idx_value_1[2] = 1 idx_value_2[2] = 2, idx_value_3[2] = 3.

I've tried tackling problems similar to the one you describe, and I think the np.argwhere function is probably your best option for reducing runtime (see docs here). See the code example below for how this could be used per the constraints you identify above.
import numpy as np
x = np.zeros((4,4,4))
x[3,3,3] = 1; x[1,0,3] = 2; x[1,2,3] = 3
# Instantiate dictionary/array to store indices
idx_value = {}
# Get indices equal to 2
idx_value[3] = np.argwhere(x == 3)
idx_value[2] = np.argwhere(x == 2)
idx_value[1] = np.argwhere(x == 1)
# Display idx_value - consistent with indices we set before
>>> idx_value
{3: array([[1, 2, 3]]), 2: array([[1, 0, 3]]), 1: array([[3, 3, 3]])}
For the first use case, I think you would still have to use a for loop to iterate over the values you're searching over, but it could be done as:
# Instantiate dictionary/array
idx_value = {}
# Now loop by incrementally adding key/value pairs
for cur_idx,count in values_counts.items():
idx_value[cur_idx] = np.argwhere(labels)
NOTE: This incrementally creates a dictionary where each key is an idx to search for, and each value is a np.array object of shape (N_matches, 3).

Python: return the row index of the minimum in a matrix

I wanna print the index of the row containing the minimum element of the matrix
my matrix is matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
and the code
matrix = [[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]]
a = np.array(matrix)
buff_min = matrix.argmin(axis = 0)
print(buff_min) #index of the row containing the minimum element
min = np.array(matrix[buff_min])
print(str(min.min(axis=0))) #print the minium of that row
print(min.argmin(axis = 0)) #index of the minimum
print(matrix[buff_min]) # print all row containing the minimum
after running, my result is
1
3
1
[22, 3, 4, 12]
the first number should be 2, because the minimum is 2 in the third list ([34,6,4,5,8,2]), but it returns 1. It returns 3 as minimum of the matrix.
What's the error?

I am not sure which version of Python you are using, i tested it for Python 2.7 and 3.2 as mentioned your syntax for argmin is not correct, its should be in the format
import numpy as np
np.argmin(array_name,axis)
Next, Numpy knows about arrays of arbitrary objects, it's optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.
If you really want flexible Numpy arrays, use something like this:
np.array([[22,33,44,55],[22,3,4,12],[34,6,4,5,8,2]], dtype=object)
However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).
Also, to mention if you can resize your numpy array thing might work, i haven't tested it, but by the concept that should be an easy solution. But i will prefer use a nested list in this case of input matrix

Does this work?
np.where(a == a.min())[0][0]
Note that all rows of the matrix need to contain the same number of elements.

Applying a probabilistic function over specific element of an array individually with Numpy/Python

I'm fairly new to Python/Numpy. What I have here is a standard array and I have a function which I have vectorized appropriately.
def f(i):
return np.random.choice(2,1,p=[0.7,0.3])*9
f = np.vectorize(f)
Defining an example array:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
With the vectorized function, f, I would like to evaluate f on each cell on the array with a value of 0.
I am trying to leave for loops as a last resort. My arrays will eventually be larger than 100 by 100, so running each cell individually to look and evaluate f might take too long.
I have tried:
print f(array[array==0])
Unfortunately, this gives me a row array consisting of 5 elements (the zeroes in my original array).
Alternatively I have tried,
array[array==0] = f(1)
But as expected, this just turns every single zero element of array into 0's or 9's.
What I'm looking for is somehow to give me my original array with the zero elements replaced individually. Ideally, 30% of my original zero elements will become 9 and the array structure is conserved.
Thanks

The reason your first try doesn't work is because the vectorized function handle, let's call it f_v to distinguish it from the original f, is performing the operation for exactly 5 elements: the 5 elements that are returned by the boolean indexing operation array[array==0]. That returns 5 values, it doesn't set those 5 items to the returned values. Your analysis of why the 2nd form fails is spot-on.
If you wanted to solve it you could combine your second approach with adding the size option to np.random.choice:
array = np.array([[1,1,0],[0,1,0],[0,0,1]])
mask = array==0
array[mask] = np.random.choice([18,9], size=mask.sum(), p=[0.7, 0.3])
# example output:
# array([[ 1, 1, 9],
# [18, 1, 9],
# [ 9, 18, 1]])
There was no need for np.vectorize: the size option takes care of that already.

Replace loop with broadcasting in numpy -> memory error

I have an 2D-array (array1), which has an arbitrary number of rows and in the first column I have strictly monotonic increasing numbers (but not linearly), which represent a position in my system, while the second one gives me a value, which represents the state of my system for and around the position in the first column.
Now I have a second array (array2); its range should usually be the same as for the first column of the first array, but does not matter to much, as you will see below.
I am now interested for every element in array2:
1. What is the argument in array1[:,0], which has the closest value to the current element in array2?
2. What is the value (array1[:,1]) of those elements.
As usually array2 will be longer than the number of rows in array1 it is perfectly fine, if I get one argument from array1 more than one time. In fact this is what I expect.
The value from 2. is written in the second and third column, as you will see below.
My striped code looks like this:
from numpy import arange, zeros, absolute, argmin, mod, newaxis, ones
ysize1 = 50
array1 = zeros((ysize1+1,2))
array1[:,0] = arange(ysize1+1)**2
# can be any strictly monotonic increasing array
array1[:,1] = mod(arange(ysize1+1),2)
# in my current case, but could also be something else
ysize2 = (ysize1)**2
array2 = zeros((ysize2+1,3))
array2[:,0] = arange(0,ysize2+1)
# is currently uniformly distributed over the whole range, but does not necessarily have to be
a = 0
for i, array2element in enumerate(array2[:,0]):
a = argmin(absolute(array1[:,0]-array2element))
array2[i,1] = array1[a,1]
It works, but takes quite a lot time to process large arrays. I then tried to implement broadcasting, which seems to work with the following code:
indexarray = argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
array2[:,2]=array1[indexarray,1] # just to compare the results
Unfortunately now I seem to run into a different problem: I get a memory error on the sizes of arrays I am using in the line of code with the broadcasting.
For small sizes it works, but for larger ones where len(array2[:,0]) is something like 2**17 (and could be even larger) and len(array1[:,0]) is about 2**14. I get, that the size of the array is bigger than the available memory. Is there an elegant way around that or to speed up the loop?
I do not need to store the intermediate array(s), I am just interested in the result.
Thanks!

First lets simplify this line:
argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
it should be:
a = array1[:, 0]
b = array2[:, 0]
argmin(abs(a - b[:, newaxis]), 1)
But even when simplified, you're creating two large temporary arrays. If a and b have sizes M and N, b - a and abs(...) each create a temporary array of size (M, N). Because you've said that a is monotonically increasing, you can avoid the issue all together by using a binary search (sorted search) which is much faster anyways. Take a look at the answer I wrote to this question a while back. Using the function from this answer, try this:
closest = find_closest(array1[:, 0], array2[:, 0])
array2[:, 2] = array1[closest, 1]

Creating an array without certain ranges

In python I have numpy.ndarray called a and a list of indices called b. I want to get a list of all the values of a which are not in -10..10 places around the indices of b.
This is my current code, which takes a lot of time to run due to allocations of data (a is very big):
aa=a
# Remove all ranges backwards
for bb in b[::-1]:
aa=np.delete(aa, range(bb-10,bb+10))
Is there a way to do it more efficiently? Preferably with few memory allocations.

np.delete will take an array of indicies of any size. You can simply populate your entire array of indicies and perform the delete once, therefore only deallocating and reallocating once. (not tested. possible typos.)
bb = np.empty((b.size, 21), dtype=int)
for i,v in enumerate(b):
bb[i] = v+np.arange(-10,11)
np.delete(a, bb.flat) # looks like .flat is optional
Note, if your ranges overlap, you'll get a difference between this and your algorithm: where yours will remove more items than those originally 10 indices away.

Could you find a certain number that you're sure will not be in a, and then set all indices around the b indices to that number, so that you can remove it afterwards?
import numpy as np
for i in range(-10, 11):
a[b + i] = number_not_in_a
values = set(np.unique(a)) - set([number_not_in_a])
This code will not allocate new memory for a at all, needs only one range object created, and does the job in exactly 22 c-optimized numpy operations (well, 43 if you count the b + i operations), plus the cost of turning the unique return array into a set.
Beware, if b includes indices which are less than 10, the number_not_in_a "zone" around these indices will wrap around to the other end of the array. If b includes indices larger than len(a) - 11, the operation will fail with an IndexError at some point.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mapping between ids and indices using numpy arrays - python

Related

Finding 3D indices of all matching values in numpy

Python: return the row index of the minimum in a matrix

Applying a probabilistic function over specific element of an array individually with Numpy/Python

Replace loop with broadcasting in numpy -> memory error

Creating an array without certain ranges

Categories

Resources