Suppose that I have a numpy.array of values, say
values = np.array([0, 3, 2, 4, 6])
and a numpy.array of indices, say
idces = np.array([1, 3, 5]).
I want to obtain an array which has a given value, say -1, in the positions of the idces, and the other elements distributed in the remaining locations. So in the case above I want to obtain
np.array([0, -1, 3, -1, 2, -1, 4, 6]).
This looks like the task of np.insert, except that the latter inserts values before the values at the specified indexes, rather than at the specified indexes (and the two coincide only when there is only one index).
So the best I could come up with is
np.insert(values, idces - np.arange(len(idces)), -1).
This is still better than creating an array with -np.ones, calculating the indices of the idces and then using np.put... but I was wondering: is there any cleaner way?
Insertion is best thought of in terms of offsets, which enumerate not the array elements but the gaps between (or before/after) them:
The documentation of np.insert describes it as "the index or indices before which values is inserted" which is only approximately right. An offset can be equal to len(arr) (end of array) even though arr[len(arr)] throws out-of-bounds error.
For example, np.insert([3, 1, 4, 1, 5], [1, 3, 3, 5], [0, 0, 0, 0]) means: put one zero at the gap numbered 1, two others at the gap numbered 3, and the last one at the end. The result is [3, 0, 1, 4, 0, 0, 1, 5, 0].
Some advantages of this enumeration over specifying post-insertion indices of new elements:
1) It's easier to insert a bunch of elements at one place: np.insert(arr, [3]*values.size, values) inserts the array values at the 3rd offset.
2) It's easier to interlace two arrays, with np.insert(arr, np.arange(values.size), values)
3) It's easier to control whether an insertion point is valid; the validity does not depend on how many elements are being inserted.
The case when you know post-insertion indices idces is easy enough to handle, as you did with
np.insert(values, idces - np.arange(len(idces)), -1)
Related issue on NumPy tracker.
Related
I have a snippet of code that looks like this:
def slice_table(table, index_vector)
to_index_product = []
array_indices = []
for i, index in enumerate(index_vector):
if isinstance(index, list):
to_index_product.append(index)
array_indices.append(i)
index_product = np.ix_(*to_index_product)
for i, multiple in enumerate(index_product):
index_vector[array_indices[i]] = multiple
index_vector = tuple(index_vector)
sliced_table = table[index_vector]
return sliced_table
table is an np.ndarray of shape (6, 7, 2, 2, 2, 11, 9).
The purpose of the function is to pick out values that satisfy all the given indices. Since advanced NumPy indexing picks out separate value using one to one correspondence in the given index array instead of the desired intersections, I use np.nx_() to build matrices that would allow me to extract entire dimension values rather than just separate values. My initial test slice worked as desired, so I was content with the code:
index_vector = [5, [1, 2], 1, 1, 1, [0, 3, 7], slice(0, 9, None)]
# The actual `index_vector` is code-generated, hence the usage of `slice()` object
sliced_table = slice_table(table, index_vector)
sliced_table.shape # (2, 3, 9)
In this example, every dimension except for the 2nd, 6th and 7th get an integer for an index and are thus absent from the slice. The shape of the slice is obvious from the vector because it has 2 integers as the second index, 3 integers as the 6th and a slice as the 7th index (meaning the entire length of the dimension is preserved). These examples also work:
index_vector = [5, [1, 2], 1, 1, 1, [0, 3, 7, 8], 1]
sliced_table = slice_table(table, index_vector)
sliced_table.shape # (2, 4)
index_vector = [5, [1, 2], 1, 1, 1, [0, 3, 7, 8], [1, 3]]
sliced_table = slice_table(table, index_vector)
sliced_table.shape # (2, 4, 2)
However, for the code below, the shape is not what I expect it to be:
index_vector = [
slice(0, 6, None),
[1, 2],
slice(0, 2, None),
slice(0, 2, None),
slice(0, 2, None),
[0, 3, 7, 8],
1,
]
sliced_table = slice_table(table, index_vector)
sliced_table.shape # (2, 4, 6, 2, 2, 2)
The shape I want it to be is (6, 2, 2, 2, 2, 4), but for some reason there's a reshuffling taking place and the shape is all wrong. It's a bit hard to say whether the elements are wrong, too, because most of table is filled with None, but from the non-NoneType objects that I get, it feels that I get the desired elements (I don't see any undesired ones, that is), just reshaped for some reason.
Why does this happen? Maybe I don't correctly understand how np.ix_() works and I can't just build a product of array indices and extract the desired matrices for each dimension one by one, like I do in my function? Or is there something I don't get about NumPy indexing?
As #hpaulj mentioned, advanced indexing forms the first subset of dimensions, followed by basic indices. Since slice objects trigger basic indexing, their dimensions are appended to the subslice made by advanced indices. An exerpt from the docs:
The easiest way to understand a combination of multiple advanced
indices may be to think in terms of the resulting shape. There are two
parts to the indexing operation, the subspace defined by the basic
indexing (excluding integers) and the subspace from the advanced
indexing part. Two cases of index combination need to be
distinguished:
The advanced indices are separated by a slice, Ellipsis or newaxis.
For example x[arr1, :, arr2].
The advanced indices are all next to each other. For example x[..., arr1, arr2, :] but not x[arr1, :, 1] since 1 is an advanced index in
this regard.
In the first case, the dimensions resulting from the advanced indexing
operation come first in the result array, and the subspace dimensions
after that. In the second case, the dimensions from the advanced
indexing operations are inserted into the result array at the same
spot as they were in the initial array (the latter logic is what makes
simple advanced indexing behave just like slicing).
I have a little bit of a tricky problem here...
Given two arrays A and B
A = np.array([8, 5, 3, 7])
B = np.array([5, 5, 7, 8, 3, 3, 3])
I would like to replace the values in B with the index of that value in A. In this example case, that would look like:
[1, 1, 3, 0, 2, 2, 2]
For the problem I'm working on, A and B contain the same set of values and all of the entries in A are unique.
The simple way to solve this is to use something like:
for idx in range(len(A)):
ind = np.where(B == A[idx])[0]
B_new[ind] = A[idx]
But the B array I'm working with contains almost a million elements and using a for loop gets super slow. There must be a way to vectorize this, but I can't figure it out. The closest I've come is to do something like
np.intersect1d(A, B, return_indices=True)
But this only gives me the first occurrence of each element of A in B. Any suggestions?
The solution of #mozway is good for small array but not for big ones as it runs in O(n**2) time (ie. quadratic time, see time complexity for more information). Here is a much better solution for big array running in O(n log n) time (ie. quasi-linear) based on a fast binary search:
unique_values, index = np.unique(A, return_index=True)
result = index[np.searchsorted(unique_values, B)]
Use numpy broadcasting:
np.where(B[:, None]==A)[1]
NB. the values in A must be unique
Output:
array([1, 1, 3, 0, 2, 2, 2])
Though cant tell exactly what the complexity of this is, I belive it will perform quite well:
A.argsort()[np.unique(B, return_inverse = True)[1]]
array([1, 1, 3, 0, 2, 2, 2], dtype=int64)
I have a numpy array:
foo = array([3, 1, 4, 0, 1, 0])
I want the top 3 items. Calling
foo.argsort()[::-1][:3]
returns
array([2, 0, 4])
Notice values foo[1] and foo[4] are equal, so numpy.argsort() handles the tie by returning the index of the item which appears last in the array; i.e. index 4.
For my application I want the tie breaking to return the index of the item which appears first in the array (index 1 here). How do I implement this efficiently?
What about simply this?
(-foo).argsort(kind='mergesort')[:3]
Why this works:
Argsorting in descending order (not what np.argsort does) is the same as argsorting in ascending order (what np.argsort does) the opposite values. You then just need to pick the first 3 sorted indices. Now all you need is make sure that the sort is stable, meaning in case of ties, keep first index first.
NOTE: I thought the default kind=quicksort was stable but from the doc it appears only kind=mergesort is guaranteed to be stable: (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html)
The various sorting algorithms are characterized by their average speed, worst case performance, work space size, and whether they are stable. A stable sort keeps items with the same key in the same relative order. The three available algorithms have the following properties:
kind speed worst case work space stable
‘quicksort’ 1 O(n^2) 0 no
‘mergesort’ 2 O(n*log(n)) ~n/2 yes
‘heapsort’ 3 O(n*log(n)) 0 no
This is an extremely hacky answer, but why don't you just argsort the array in reverse? That way argsort picks the last index (in reverse), which is the first index.
This translates to:
>>> foo = np.array([3, 1, 4, 0, 1, 0])
>>> foo.argsort()[::-1]
array([2, 0, 4, 1, 5, 3])
>>> foo.size - 1 - foo[::-1].argsort()[::-1]
array([2, 0, 1, 4, 3, 5])
I have 3 numpy.ndarray vectors, X, Y and intensity. I would like to mix it in an numpy array, then sort by the third column (or the first one). I tried the following code:
m=np.column_stack((X,Y))
m=np.column_stack((m,intensity))
m=np.sort(m,axis=2)
Then I got the error: ValueError: axis(=2) out of bounds.
When I print m, I get:
array([[ 109430, 285103, 121],
[ 134497, 284907, 134],
[ 160038, 285321, 132],
...,
[12374406, 2742429, 148],
[12371858, 2741994, 148],
[12372221, 2742017, 161]])
How can I fix it. that is, get a sorted array?
Axis=2 does not refer to the column index but rather, to the dimension of the array. It means numpy will try to look for a third dimension in the data and sorts it from smallest to largest in the third dimension. Sorting from smallest to largest in the first dimension (axis = 0) would be have the values in all rows going from smallest to largest. Sorting from smallest to largest in the second dimension (axis = 1) would be have the values in all columns going from smallest to largest. Examples would be below.
Furthermore, sort would work differently depending on the base array. Two arrays are considered: Unstructured and structured.
Unstructured
X = np.nrandn(10)
X = np.nrandn(10)
intensity = np.nrandn(10)
m=np.column_stack((X,Y))
m=np.column_stack((m,intensity))
m is being treated as an unstructured array because there are no fields linked to any of the columns. In other words, if you call np.sort() on m, it will just sort them from smallest to largest from top to bottom if axis=0 and left to right if axis=1. The rows are not being preserved.
Original:
[[ 1.20122251 1.41451461 -1.66427245]
[ 1.3657312 -0.2318793 -0.23870104]
[-0.30280613 0.79123814 -1.64082042]]
Axis=1:
[[-1.66427245 1.20122251 1.41451461]
[-0.23870104 -0.2318793 1.3657312 ]
[-1.64082042 -0.30280613 0.79123814]]
Axis = 0:
[[-0.30280613 -0.2318793 -1.66427245]
[ 1.20122251 0.79123814 -1.64082042]
[ 1.3657312 1.41451461 -0.23870104]]
Structured
As you can see, the data structure in the rows is not kept. If you would like to preserve the row order, you need to add in labels to the datatypes and create an array with this. You can sort by the other columns with order = label_name.
dtype = [("a",float),("b",float),("c",float)]
m = [tuple(x) for x in m]
labelled_arr = np.array(m,dtype)
print np.sort(labelled_arr,order="a")
This will get:
[(-0.30280612629541204, 0.7912381363389004, -1.640820419927318)
(1.2012225144719493, 1.4145146097431947, -1.6642724545574712)
(1.3657312047892836, -0.23187929505306418, -0.2387010374198555)]
Another more convenient way of doing this would be passing the data into a pandas dataframe which automatically creates column names from 0 to n-1. Then you can just call the sort_values method and pass in the column index you want and follow it by axis=0 if you would like it to be sorted from top to bottom just like in numpy.
Example:
pd.DataFrame(m).sort_values(0,axis = 0)
Output:
0 1 2
2 -0.302806 0.791238 -1.640820
0 1.201223 1.414515 -1.664272
1 1.365731 -0.231879 -0.238701
You are getting that error because you don't have an axis with a 2 index. Axes are zero-indexed. Regardless, np.sort will sort every column, or every row. Consider from the docs:
order : str or list of str, optional When a is an array with fields
defined, this argument specifies which fields to compare first,
second, etc. A single field can be specified as a string, and not all
fields need be specified, but unspecified fields will still be used,
in the order in which they come up in the dtype, to break ties.
For example:
In [28]: a
Out[28]:
array([[0, 0, 1],
[1, 2, 3],
[3, 1, 8]])
In [29]: np.sort(a, axis = 0)
Out[29]:
array([[0, 0, 1],
[1, 1, 3],
[3, 2, 8]])
In [30]: np.sort(a, axis = 1)
Out[30]:
array([[0, 0, 1],
[1, 2, 3],
[1, 3, 8]])
So, I think what you really want is this neat little idiom:
In [32]: a[a[:,2].argsort()]
Out[32]:
array([[0, 0, 1],
[1, 2, 3],
[3, 1, 8]])
The type of matrix I am dealing with was created from a vector as shown below:
Start with a 1-d vector V of length L.
To create a matrix A from V with N rows, make the i'th column of A the first N entries of V, starting from the i'th entry of V, so long as there are enough entries left in V to fill up the column. This means A has L - N + 1 columns.
Here is an example:
V = [0, 1, 2, 3, 4, 5]
N = 3
A =
[0 1 2 3
1 2 3 4
2 3 4 5]
Representing the matrix this way requires more memory than my machine has. Is there any reasonable way of storing this matrix sparsely? I am currently storing N * (L - N + 1) values, when I only need to store L values.
You can take a view of your original vector as follows:
>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>>
>>> v = np.array([0, 1, 2, 3, 4, 5])
>>> n = 3
>>>
>>> a = as_strided(v, shape=(n, len(v)-n+1), strides=v.strides*2)
>>> a
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
This is a view, not a copy of your original data, e.g.
>>> v[3] = 0
>>> v
array([0, 1, 2, 0, 4, 5])
>>> a
array([[0, 1, 2, 0],
[1, 2, 0, 4],
[2, 0, 4, 5]])
But you have to be careful no to do any operation on a that triggers a copy, since that would send your memory use through the ceiling.
If you're already using numpy, use its strided or sparse arrays, as Jaime explained.
If you're not already using numpy, you may to strongly consider using it.
If you need to stick with pure Python, there are three obvious ways to do this, depending on your use case.
For strided or sparse-but-clustered arrays, you could do effectively the same thing as numpy.
Or you could use a simple run-length-encoding scheme, plus maybe a higher-level list of runs for, or list of pointers to every Nth element, or even a whole stack of such lists (one for every 100 elements, one for every 10000, etc.).
But for mostly-uniformly-dense arrays, the easiest thing is to simply store a dict or defaultdict mapping indices to values. Random-access lookups or updates are still O(1)—albeit with a higher constant factor—and the storage you waste storing (in effect) a hash, key, and value instead of just a value for each non-default element is more than made up for by not storing values for the default elements, as long as you're less than 0.33 density.