Python: How to index the elements of a numpy array? - python

I'm looking for a function that would do what the function indices does in the following hypothetical code:
indices( numpy.array([[1, 2, 3], [2, 3, 4]]) )
{1: [(0,0)], 2: [(0,1),(1,0)], 3: [(0,2),(1,1)], 4: [(1,2)]}
Specifically, I want to produce a dictionary whose keys are the unique elements in the flattened array and whose values are lists of the full indices of the respective key.
I've looked at the where function, but it does not seem to provide an efficient way to solve this for large arrays. What's the best way to do this?
Notes: I'm using Python 2.7

Given that your desired output is a dictionary, I don't think there's going to be an efficient way to do this with NumPy operations. Your best bet will probably be something like
import collections
import itertools
d = collections.defaultdict(list)
for indices in itertools.product(*map(range, a.shape)):
d[a[indices]].append(indices)

the numpy_indexed package can perform these kind of grouping operations in an efficient and fully vectorized manner, ie:
import numpy_indexed as npi
a = np.array([[1, 2, 3], [2, 3, 4]])
keys, values = npi.group_by(a.reshape(-1), np.indices(a.shape).reshape(-1, a.ndim))

I don't know about numpy, but this is an example solution if just using arrays:
arrs = [[1, 2, 3], [2, 3, 4]]
dict = {}
for i in range(0, len(arrs)):
arr = arrs[i]
for j in range(0, len(arr)):
num = arr[j]
indices = dict.get(num)
if indices is None:
dict[num] = [(i, j)]
else:
dict[num].append((i, j))

Related

need to grab entire first element (array) in a multi-dimentional list of arrays python3

Apologies if this has already been asked, but I searched quite a bit and couldn't find quite the right solution. I'm new to python, but I'll try to be as clear as possible. In short, I have a list of arrays in the following format resulting from a joining a multiprocessing pool:
array = [[[1,2,3], 5, 47, 2515],..... [[4,5,6], 3, 35, 2096]]]
and I want to get all values from the first array element to form a new array in the following form:
print(new_array)
[1,2,3,4,5,6]
In my code, I was trying to get the first value through this function:
new_array = array[0][0]
but this only returns the first value as such:
print(new_array)
[1,2,3]
I also tried np.take after converting the array into a np array:
array = np.array(array)
new_array = np.take(results,0)
print(new_array)
[1,2,3]
I have tried a number of np functions (concatenate, take, etc.) to try and iterate this over the list, but get back the following error (presumably because the size of the array changes):
ValueError: autodetected range of [[], [1445.0, 1445.0, -248.0, 638.0, -108.0, 649.0]] is not finite
Thanks for any help!
You can achieve it without numpy using reduce:
from functools import reduce
l = [[[1,2,3], 5, 47, 2515], [[4,5,6], 3, 35, 2096]]
res = reduce(lambda a, b: [*a, *b], [x[0] for x in l])
Output
[1, 2, 3, 4, 5, 6]
Maybe it is worth mentioning that [*a, *b] is a way to concatenate lists in python, for example:
[*[1, 2, 3], *[4, 5, 6]] # [1, 2, 3, 4, 5, 6]
You could also use itertools' chain() function to flatten an extraction of the first subArray in each element of the list:
from itertools import chain
result = list(chain(*[sub[0] for sub in array]))

Get all the rows with same values in python?

So, suppose I have this 2D array in python
a = [[1,2]
[2,3]
[3,2]
[1,3]]
How do get all array entries with the same row value and store them in a new matrix.
For example, I will have
b = [1,2]
[1,3]
after the query.
My approach is b = [a[i] for i in a if a[i][0] == 1][0]]
but it didn't seem to work?
I am new to Python and the whole index slicing thing is kind confusing. Thanks!
Since you tagged numpy, you can perform this task with NumPy arrays. First define your array:
a = np.array([[1, 2],
[2, 3],
[3, 2],
[1, 3]])
For all unique values in the first column, you can use a dictionary comprehension. This is useful to avoid duplicating operations.
d = {i: a[a[:, 0] == i] for i in np.unique(a[:, 0])}
{1: array([[1, 2],
[1, 3]]),
2: array([[2, 3]]),
3: array([[3, 2]])}
Then access your array where first column is equal to 1 via d[1].
For a single query, you can simply use a[a[:, 0] == 1].
The for i in a syntax gives you the actual items in the list..so for example:
list_of_strs = ['first', 'second', 'third']
first_letters = [s[0] for s in list_of_strs]
# first_letters == ['f', 's', 't']
What you are actually doing with b = [a[i] for i in a if a[i][0]==1] is trying to index an element of a with each of the elements of a. But since each element of a is itself a list, this won't work (you can't index lists with other lists)
Something like this should work:
b = [row for row in a if row[0] == 1]
Bonus points if you write it as a function so that you can pick which thing you want to filter on.
If you're working with arrays a lot, you might also check out the numpy library. With numpy, you can do stuff like this.
import numpy as np
a = np.array([[1,2], [2,3], [3,2], [1,3]])
b = a[a[:,0] == 1]
The last line is basically indexing the original array a with a boolean array defined inside the first set of square brackets. It's very flexible, so you could also modify this to filter on the second element, filter on other conditions (like > some_number), etc. etc.

Efficiently find indices of all values in an array

I have a very large array, consisting of integers between 0 and N, where each value occurs at least once.
I'd like to know, for each value k, all the indices in my array where the array's value equals k.
For example:
arr = np.array([0,1,2,3,2,1,0])
desired_output = {
0: np.array([0,6]),
1: np.array([1,5]),
2: np.array([2,4]),
3: np.array([3]),
}
Right now I am accomplishing this with a loop over range(N+1), and calling np.where N times.
indices = {}
for value in range(max(arr)+1):
indices[value] = np.where(arr == value)[0]
This loop is by far the slowest part of my code. (Both the arr==value evaluation and the np.where call take up significant chunks of time.) Is there a more efficient way to do this?
I also tried playing around with np.unique(arr, return_index=True) but that only tells me the very first index, rather than all of them.
Approach #1
Here's a vectorized approach to get those indices as list of arrays -
sidx = arr.argsort()
unq, cut_idx = np.unique(arr[sidx],return_index=True)
indices = np.split(sidx,cut_idx)[1:]
If you want the final dictionary that corresponds each unique element to their indices, finally we can use a loop-comprehension -
dict_out = {unq[i]:iterID for i,iterID in enumerate(indices)}
Approach #2
If you are just interested in the list of arrays, here's an alternative meant for performance -
sidx = arr.argsort()
indices = np.split(sidx,np.flatnonzero(np.diff(arr[sidx])>0)+1)
A pythonic way is using collections.defaultdict():
>>> from collections import defaultdict
>>>
>>> d = defaultdict(list)
>>>
>>> for i, j in enumerate(arr):
... d[j].append(i)
...
>>> d
defaultdict(<type 'list'>, {0: [0, 6], 1: [1, 5], 2: [2, 4], 3: [3]})
And here is a Numpythonic way using a dictionary comprehension and numpy.where():
>>> {i: np.where(arr == i)[0] for i in np.unique(arr)}
{0: array([0, 6]), 1: array([1, 5]), 2: array([2, 4]), 3: array([3])}
And here is a pure Numpythonic approach if you don't want to involve the dictionary:
>>> uniq = np.unique(arr)
>>> args, indices = np.where((np.tile(arr, len(uniq)).reshape(len(uniq), len(arr)) == np.vstack(uniq)))
>>> np.split(indices, np.where(np.diff(args))[0] + 1)
[array([0, 6]), array([1, 5]), array([2, 4]), array([3])]
I don't know numpy but you could definitely do this in one iteration, with a defaultdict:
indices = defaultdict(list)
for i, val in enumerate(arr):
indices[val].append(i)
Fully vectorized solution using the numpy_indexed package:
import numpy_indexed as npi
k, idx = npi.groupy_by(arr, np.arange(len(arr)))
On a higher level; why do you need these indices? Subsequent grouped-operations can usually be computed much more efficiently using the group_by functionality [eg, npi.group_by(arr).mean(someotherarray)], without explicitly computing the indices of the keys.

Replace values in a large list of arrays (performance)

I have a performance problem with replacing values of a list of arrays using a dictionary.
Let's say this is my dictionary:
# Create a sample dictionary
keys = [1, 2, 3, 4]
values = [5, 6, 7, 8]
dictionary = dict(zip(keys, values))
And this is my list of arrays:
# import numpy as np
# List of arrays
listvalues = []
arr1 = np.array([1, 3, 2])
arr2 = np.array([1, 1, 2, 4])
arr3 = np.array([4, 3, 2])
listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)
listvalues
>[array([1, 3, 2]), array([1, 1, 2, 4]), array([4, 3, 2])]
I then use the following function to replace all values in a nD numpy array using a dictionary:
# Replace function
def replace(arr, rep_dict):
rep_keys, rep_vals = np.array(list(zip(*sorted(rep_dict.items()))))
idces = np.digitize(arr, rep_keys, right=True)
return rep_vals[idces]
This function is really fast, however I need to iterate over my list of arrays to apply this function to each array:
replaced = []
for i in xrange(len(listvalues)):
replaced.append(replace(listvalues[i], dictionary))
This is the bottleneck of the process, as it needs to iterate over thousands of arrays.
How could I do achieve the same result without using the for-loop? It is important that the result is in the same format as the input (a list of arrays with replaced values)
Many thanks guys!!
This will do the trick efficiently, using the numpy_indexed package. It can be further simplified if all values in 'listvalues' are guaranteed to be present in 'keys'; but ill leave that as an exercise to the reader.
import numpy_indexed as npi
arr = np.concatenate(listvalues)
idx = npi.indices(keys, arr, missing='mask')
remap = np.logical_not(idx.mask)
arr[remap] = np.array(values)[idx[remap]]
replaced = np.array_split(arr, np.cumsum([len(a) for a in listvalues][:-1]))

How to get indices of a sorted array in Python

I have a numerical list:
myList = [1, 2, 3, 100, 5]
Now if I sort this list to obtain [1, 2, 3, 5, 100].
What I want is the indices of the elements from the
original list in the sorted order i.e. [0, 1, 2, 4, 3]
--- ala MATLAB's sort function that returns both
values and indices.
If you are using numpy, you have the argsort() function available:
>>> import numpy
>>> numpy.argsort(myList)
array([0, 1, 2, 4, 3])
http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
This returns the arguments that would sort the array or list.
Something like next:
>>> myList = [1, 2, 3, 100, 5]
>>> [i[0] for i in sorted(enumerate(myList), key=lambda x:x[1])]
[0, 1, 2, 4, 3]
enumerate(myList) gives you a list containing tuples of (index, value):
[(0, 1), (1, 2), (2, 3), (3, 100), (4, 5)]
You sort the list by passing it to sorted and specifying a function to extract the sort key (the second element of each tuple; that's what the lambda is for. Finally, the original index of each sorted element is extracted using the [i[0] for i in ...] list comprehension.
myList = [1, 2, 3, 100, 5]
sorted(range(len(myList)),key=myList.__getitem__)
[0, 1, 2, 4, 3]
I did a quick performance check on these with perfplot (a project of mine) and found that it's hard to recommend anything else but
np.argsort(x)
(note the log scale):
Code to reproduce the plot:
import perfplot
import numpy as np
def sorted_enumerate(seq):
return [i for (v, i) in sorted((v, i) for (i, v) in enumerate(seq))]
def sorted_enumerate_key(seq):
return [x for x, y in sorted(enumerate(seq), key=lambda x: x[1])]
def sorted_range(seq):
return sorted(range(len(seq)), key=seq.__getitem__)
b = perfplot.bench(
setup=np.random.rand,
kernels=[sorted_enumerate, sorted_enumerate_key, sorted_range, np.argsort],
n_range=[2 ** k for k in range(15)],
xlabel="len(x)",
)
b.save("out.png")
The answers with enumerate are nice, but I personally don't like the lambda used to sort by the value. The following just reverses the index and the value, and sorts that. So it'll first sort by value, then by index.
sorted((e,i) for i,e in enumerate(myList))
Updated answer with enumerate and itemgetter:
sorted(enumerate(a), key=lambda x: x[1])
# [(0, 1), (1, 2), (2, 3), (4, 5), (3, 100)]
Zip the lists together: The first element in the tuple will the index, the second is the value (then sort it using the second value of the tuple x[1], x is the tuple)
Or using itemgetter from the operatormodule`:
from operator import itemgetter
sorted(enumerate(a), key=itemgetter(1))
Essentially you need to do an argsort, what implementation you need depends if you want to use external libraries (e.g. NumPy) or if you want to stay pure-Python without dependencies.
The question you need to ask yourself is: Do you want the
indices that would sort the array/list
indices that the elements would have in the sorted array/list
Unfortunately the example in the question doesn't make it clear what is desired because both will give the same result:
>>> arr = np.array([1, 2, 3, 100, 5])
>>> np.argsort(np.argsort(arr))
array([0, 1, 2, 4, 3], dtype=int64)
>>> np.argsort(arr)
array([0, 1, 2, 4, 3], dtype=int64)
Choosing the argsort implementation
If you have NumPy at your disposal you can simply use the function numpy.argsort or method numpy.ndarray.argsort.
An implementation without NumPy was mentioned in some other answers already, so I'll just recap the fastest solution according to the benchmark answer here
def argsort(l):
return sorted(range(len(l)), key=l.__getitem__)
Getting the indices that would sort the array/list
To get the indices that would sort the array/list you can simply call argsort on the array or list. I'm using the NumPy versions here but the Python implementation should give the same results
>>> arr = np.array([3, 1, 2, 4])
>>> np.argsort(arr)
array([1, 2, 0, 3], dtype=int64)
The result contains the indices that are needed to get the sorted array.
Since the sorted array would be [1, 2, 3, 4] the argsorted array contains the indices of these elements in the original.
The smallest value is 1 and it is at index 1 in the original so the first element of the result is 1.
The 2 is at index 2 in the original so the second element of the result is 2.
The 3 is at index 0 in the original so the third element of the result is 0.
The largest value 4 and it is at index 3 in the original so the last element of the result is 3.
Getting the indices that the elements would have in the sorted array/list
In this case you would need to apply argsort twice:
>>> arr = np.array([3, 1, 2, 4])
>>> np.argsort(np.argsort(arr))
array([2, 0, 1, 3], dtype=int64)
In this case :
the first element of the original is 3, which is the third largest value so it would have index 2 in the sorted array/list so the first element is 2.
the second element of the original is 1, which is the smallest value so it would have index 0 in the sorted array/list so the second element is 0.
the third element of the original is 2, which is the second-smallest value so it would have index 1 in the sorted array/list so the third element is 1.
the fourth element of the original is 4 which is the largest value so it would have index 3 in the sorted array/list so the last element is 3.
If you do not want to use numpy,
sorted(range(len(seq)), key=seq.__getitem__)
is fastest, as demonstrated here.
The other answers are WRONG.
Running argsort once is not the solution.
For example, the following code:
import numpy as np
x = [3,1,2]
np.argsort(x)
yields array([1, 2, 0], dtype=int64) which is not what we want.
The answer should be to run argsort twice:
import numpy as np
x = [3,1,2]
np.argsort(np.argsort(x))
gives array([2, 0, 1], dtype=int64) as expected.
Most easiest way you can use Numpy Packages for that purpose:
import numpy
s = numpy.array([2, 3, 1, 4, 5])
sort_index = numpy.argsort(s)
print(sort_index)
But If you want that you code should use baisc python code:
s = [2, 3, 1, 4, 5]
li=[]
for i in range(len(s)):
li.append([s[i],i])
li.sort()
sort_index = []
for x in li:
sort_index.append(x[1])
print(sort_index)
We will create another array of indexes from 0 to n-1
Then zip this to the original array and then sort it on the basis of the original values
ar = [1,2,3,4,5]
new_ar = list(zip(ar,[i for i in range(len(ar))]))
new_ar.sort()
`
s = [2, 3, 1, 4, 5]
print([sorted(s, reverse=False).index(val) for val in s])
For a list with duplicate elements, it will return the rank without ties, e.g.
s = [2, 2, 1, 4, 5]
print([sorted(s, reverse=False).index(val) for val in s])
returns
[1, 1, 0, 3, 4]
Import numpy as np
FOR INDEX
S=[11,2,44,55,66,0,10,3,33]
r=np.argsort(S)
[output]=array([5, 1, 7, 6, 0, 8, 2, 3, 4])
argsort Returns the indices of S in sorted order
FOR VALUE
np.sort(S)
[output]=array([ 0, 2, 3, 10, 11, 33, 44, 55, 66])
Code:
s = [2, 3, 1, 4, 5]
li = []
for i in range(len(s)):
li.append([s[i], i])
li.sort()
sort_index = []
for x in li:
sort_index.append(x[1])
print(sort_index)
Try this, It worked for me cheers!
firstly convert your list to this:
myList = [1, 2, 3, 100, 5]
add a index to your list's item
myList = [[0, 1], [1, 2], [2, 3], [3, 100], [4, 5]]
next :
sorted(myList, key=lambda k:k[1])
result:
[[0, 1], [1, 2], [2, 3], [4, 5], [3, 100]]
A variant on RustyRob's answer (which is already the most performant pure Python solution) that may be superior when the collection you're sorting either:
Isn't a sequence (e.g. it's a set, and there's a legitimate reason to want the indices corresponding to how far an iterator must be advanced to reach the item), or
Is a sequence without O(1) indexing (among Python's included batteries, collections.deque is a notable example of this)
Case #1 is unlikely to be useful, but case #2 is more likely to be meaningful. In either case, you have two choices:
Convert to a list/tuple and use the converted version, or
Use a trick to assign keys based on iteration order
This answer provides the solution to #2. Note that it's not guaranteed to work by the language standard; the language says each key will be computed once, but not the order they will be computed in. On every version of CPython, the reference interpreter, to date, it's precomputed in order from beginning to end, so this works, but be aware it's not guaranteed. In any event, the code is:
sizediterable = ...
sorted_indices = sorted(range(len(sizediterable)), key=lambda _, it=iter(sizediterable): next(it))
All that does is provide a key function that ignores the value it's given (an index) and instead provides the next item from an iterator preconstructed from the original container (cached as a defaulted argument to allow it to function as a one-liner). As a result, for something like a large collections.deque, where using its .__getitem__ involves O(n) work (and therefore computing all the keys would involve O(n²) work), sequential iteration remains O(1), so generating the keys remains just O(n).
If you need something guaranteed to work by the language standard, using built-in types, Roman's solution will have the same algorithmic efficiency as this solution (as neither of them rely on the algorithmic efficiency of indexing the original container).
To be clear, for the suggested use case with collections.deque, the deque would have to be quite large for this to matter; deques have a fairly large constant divisor for indexing, so only truly huge ones would have an issue. Of course, by the same token, the cost of sorting is pretty minimal if the inputs are small/cheap to compare, so if your inputs are large enough that efficient sorting matters, they're large enough for efficient indexing to matter too.

Categories