Grouping repetitions in an array? [duplicate] - python

This question already has an answer here:
What's the most Pythonic way to identify consecutive duplicates in a list?
(1 answer)
Closed 9 years ago.
I am looking for a function that gets a one dimensional sorted array and returns
a two dimensional array with two columns, first column containing non-repeated
items and second column containing number of repetiotions of the item. Right now
my code is as follows:
def priorsGrouper(priors):
if priors.size==0:
ret=priors;
elif priors.size==1:
ret=priors[0],1;
else:
ret=numpy.zeros((1,2));
pointer1,pointer2=0,0;
while(pointer1<priors.size):
counter=0;
while(pointer2<priors.size and priors[pointer2]==priors[pointer1]):
counter+=1;
pointer2+=1;
ret=numpy.row_stack((ret,[priors[pointer1],pointer2-pointer1]))
pointer1=pointer2;
return ret;
print priorsGrouper(numpy.array([1,2,2,3]))
My output is as follows:
[[ 0. 0.]
[ 1. 1.]
[ 2. 2.]
[ 3. 1.]]
First of all I cannot get rid of my [0,0]. Secondly I want to know if there is
a numpy or scipy function for this or is mine OK?
Thanks.

You could use np.unique to get the unique values in x, as well as an array of indices (called inverse). The inverse can be thought of as "labels" for the elements in x. Unlike x itself, the labels are always integers, starting at 0.
Then you can take a bincount of the labels. Since the labels start at 0, the bincount won't be filled with a lot of zeros that you don't care about.
Finally, column_stack will join y and the bincount into a 2D array:
In [84]: x = np.array([1,2,2,3])
In [85]: y, inverse = np.unique(x, return_inverse=True)
In [86]: y
Out[86]: array([1, 2, 3])
In [87]: inverse
Out[87]: array([0, 1, 1, 2])
In [88]: np.bincount(inverse)
Out[88]: array([1, 2, 1])
In [89]: np.column_stack((y,np.bincount(inverse)))
Out[89]:
array([[1, 1],
[2, 2],
[3, 1]])
Sometimes when an array is small, it turns out that using plain Python methods are faster than NumPy functions. I wanted to check if that was the case here, and, if so, how large x would have to be before NumPy methods are faster.
Here is a graph of the performance of various methods as a function of the size of x:
In [173]: x = np.random.random(1000)
In [174]: x.sort()
In [156]: %timeit using_unique(x)
10000 loops, best of 3: 99.7 us per loop
In [180]: %timeit using_groupby(x)
100 loops, best of 3: 3.64 ms per loop
In [157]: %timeit using_counter(x)
100 loops, best of 3: 4.31 ms per loop
In [158]: %timeit using_ordered_dict(x)
100 loops, best of 3: 4.7 ms per loop
For len(x) of 1000, using_unique is over 35x faster than any of the plain Python methods tested.
So it looks like using_unique is fastest, even for very small len(x).
Here is the program used to generate the graph:
import numpy as np
import collections
import itertools as IT
import matplotlib.pyplot as plt
import timeit
def using_unique(x):
y, inverse = np.unique(x, return_inverse=True)
return np.column_stack((y, np.bincount(inverse)))
def using_counter(x):
result = collections.Counter(x)
return np.array(sorted(result.items()))
def using_ordered_dict(x):
result = collections.OrderedDict()
for item in x:
result[item] = result.get(item,0)+1
return np.array(result.items())
def using_groupby(x):
return np.array([(k, sum(1 for i in g)) for k, g in IT.groupby(x)])
fig, ax = plt.subplots()
timing = collections.defaultdict(list)
Ns = [int(round(n)) for n in np.logspace(0, 3, 10)]
for n in Ns:
x = np.random.random(n)
x.sort()
timing['unique'].append(
timeit.timeit('m.using_unique(m.x)', 'import __main__ as m', number=1000))
timing['counter'].append(
timeit.timeit('m.using_counter(m.x)', 'import __main__ as m', number=1000))
timing['ordered_dict'].append(
timeit.timeit('m.using_ordered_dict(m.x)', 'import __main__ as m', number=1000))
timing['groupby'].append(
timeit.timeit('m.using_groupby(m.x)', 'import __main__ as m', number=1000))
ax.plot(Ns, timing['unique'], label='using_unique')
ax.plot(Ns, timing['counter'], label='using_counter')
ax.plot(Ns, timing['ordered_dict'], label='using_ordered_dict')
ax.plot(Ns, timing['groupby'], label='using_groupby')
plt.legend(loc='best')
plt.ylabel('milliseconds')
plt.xlabel('size of x')
plt.show()

If order is not important, use Counter.
from collections import Counter
% Counter([1,2,2,3])
= Counter({2: 2, 1: 1, 3: 1})
% Counter([1,2,2,3]).items()
[(1, 1), (2, 2), (3, 1)]
To preserve order (by first appearance), you can implement your own version of Counter:
from collections import OrderedDict
def OrderedCounter(seq):
res = OrderedDict()
for x in seq:
res.setdefault(x, 0)
res[x] += 1
return res
% OrderedCounter([1,2,2,3])
= OrderedDict([(1, 1), (2, 2), (3, 1)])
% OrderedCounter([1,2,2,3]).items()
= [(1, 1), (2, 2), (3, 1)]

If you want to count repetitions of an item you can use a dictionary:
l = [1, 2, 2, 3]
d = {}
for i in l:
if i not in d:
d[i] = 1
else:
d[i] += 1
result = [[k, v] for k, v in d.items()]
For your example returns:
[[1, 1],
[2, 2],
[3, 1]]
Good luck.

First of all, you don't need to end your statements with semicolons (;), this isn't C. :-)
Second, line 5 (and others) set ret to be value,value but that isn't a list:
>type foo.py
def foo():
return [1],2
a,b = foo()
print "a = {0}".format(a)
print "b = {0}".format(b)
Gives:
>python foo.py
a = [1]
b = 2
Third: there are easier ways to do this, here's what comes to mind:
Use the Set constructor to create a unique list of items
Create a list of the number of times each entry in Set occurs in the input string
Use zip() to combine and return the two lists as set of tuples (although this isn't exactly what you were asking for)
Here's one way:
def priorsGrouper(priors):
"""Find out how many times each element occurs in a list.
#param[in] priors List of elements
#return Two-dimensional list: first row is the unique elements,
second row is the number of occurrences of each element.
"""
# Generate a `list' containing only unique elements from the input
mySet = set(priors)
# Create the list that will store the number of occurrences
occurrenceCounts = []
# Count how many times each element occurs on the input:
for element in mySet:
occurrenceCounts.append(priors.count(element))
# Combine the two:
combinedArray = zip(mySet, occurrenceCounts)
# End of priorsGrouper() ----------------------------------------------
# Check zero-element case
print priorsGrouper([])
# Check multi-element case
sampleInput = ['a','a', 'b', 'c', 'c', 'c']
print priorsGrouper(sampleInput)

Related

What is the alternative for numpy bincount when using negative integers? [duplicate]

Suppose I have the following NumPy array:
a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
How can I find the most frequent number in this array?
If your list contains all non-negative ints, you should take a look at numpy.bincounts:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html
and then probably use np.argmax:
a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))
For a more complicated list (that perhaps contains negative numbers or non-integer values), you can use np.histogram in a similar way. Alternatively, if you just want to work in python without using numpy, collections.Counter is a good way of handling this sort of data.
from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))
You may use
values, counts = np.unique(a, return_counts=True)
ind = np.argmax(counts)
print(values[ind]) # prints the most frequent element
ind = np.argpartition(-counts, kth=10)[:10]
print(values[ind]) # prints the 10 most frequent elements
If some element is as frequent as another one, this code will return only the first element.
If you're willing to use SciPy:
>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0
Performances (using iPython) for some solutions found here:
>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>>
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>>
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>>
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>>
>>> from collections import defaultdict
>>> def jjc(l):
... d = defaultdict(int)
... for i in a:
... d[i] += 1
... return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
...
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>>
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>>
Best is 'max' with 'set' for small arrays like the problem.
According to #David Sanders, if you increase the array size to something like 100,000 elements, the "max w/set" algorithm ends up being the worst by far whereas the "numpy bincount" method is the best.
Starting in Python 3.4, the standard library includes the statistics.mode function to return the single most common data point.
from statistics import mode
mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1
If there are multiple modes with the same frequency, statistics.mode returns the first one encountered.
Starting in Python 3.8, the statistics.multimode function returns a list of the most frequently occurring values in the order they were first encountered:
from statistics import multimode
multimode([1, 2, 3, 1, 2])
# [1, 2]
Also if you want to get most frequent value(positive or negative) without loading any modules you can use the following code:
lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))
While most of the answers above are useful, in case you:
1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and
2) aren't on Python 2.7 (which collections.Counter requires), and
3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:
from collections import defaultdict
a = [1,2,3,1,2,1,1,1,3,2,2,1]
d = defaultdict(int)
for i in a:
d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
In Python 3 the following should work:
max(set(a), key=lambda x: a.count(x))
I like the solution by JoshAdel.
But there is just one catch.
The np.bincount() solution only works on numbers.
If you have strings, collections.Counter solution will work for you.
Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I've also found that this is much faster than scipy.stats.mode if there are a lot of unique values.
import numpy
def mode(ndarray, axis=0):
# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size == 1:
return (ndarray[0], 1)
elif ndarray.size == 0:
raise Exception('Cannot compute mode on empty array')
try:
axis = range(ndarray.ndim)[axis]
except:
raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))
# If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
if all([ndim == 1,
int(numpy.__version__.split('.')[0]) >= 1,
int(numpy.__version__.split('.')[1]) >= 9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)
return modals[index], counts[index]
# Sort array
sort = numpy.sort(ndarray, axis=axis)
# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis] = 1
# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis) == 0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()
# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides] = 0
# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis] += 1
shape = shape[transpose]
slices = [slice(None)] * ndim
slices[axis] = slice(1, None)
# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices] + 1
# Find maximum counts and return modals/counts
slices = [slice(None, i) for i in sort.shape]
del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))
return sort[index], counts[index]
Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.
(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]
Remember to discard the mode when len(np.argmax(counts)) > 1
You can use the following approach:
x = np.array([[2, 5, 5, 2], [2, 7, 8, 5], [2, 5, 7, 9]])
u, c = np.unique(x, return_counts=True)
print(u[c == np.amax(c)])
This will give the answer: array([2, 5])
Using np.bincount and the np.argmax method can get the most common value in a numpy array. If your array is an image array, use the np.ravel or np.flatten() methods to convert a ndarray to a 1-dimensional array.
I'm recently doing a project and using collections.Counter.(Which tortured me).
The Counter in collections have a very very bad performance in my opinion. It's just a class wrapping dict().
What's worse, If you use cProfile to profile its method, you should see a lot of '__missing__' and '__instancecheck__' stuff wasting the whole time.
Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.
Btw, numpy's bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.

How to tell when more than one index matches?

I have any array of values, that are often the same and I am trying to find the index of the smallest one. But I want to know all the objects that are the same.
So for example I have the array a = [1, 2, 3, 4] and to find the index of the smallest one I use a.index(min(a)) and this returns 0. But if I had an array of a = [1, 1, 1, 1], using the same thing would still return 0.
I want to know that multiple indices match what I am searching for and what those indices are. How would I go about doing this?
list.index(value) returns the index of the first occurrence of value in list.
A better idea is to use a simple list comprehension and enumerate:
indices = [i for i, x in enumerate(iterable) if x == v]
where v is the value you want to search for and iterable is an object that supports iterator protocol e.g. it can be a generator or a sequence (like list).
For your specific use case, that'll look like
def smallest(seq):
m = min(seq)
return [i for i, x in enumerate(seq) if x == m]
Some examples:
In [23]: smallest([1, 2, 3, 4])
Out[23]: [0]
In [24]: smallest([1, 1, 1, 1])
Out[24]: [0, 1, 2, 3]
If you're not sure whether the seq is empty or not, you can pass the default=-1 (or some other value) argument to min function (in Python 3.4+):
m = min(seq, default=-1)
Consider using m = min(seq or (-1,)) (again, any value) instead, when using older Python.
A different approach using numpy.where could look like
In [1]: import numpy as np
In [2]: def np_smallest(seq):
...: return np.where(seq==seq.min())[0]
In [3]: np_smallest(np.array([1,1,1,1]))
Out[3]: array([0, 1, 2, 3])
In [4]: np_smallest(np.array([1,2,3,4]))
Out[4]: array([0])
This approach is slighly less efficient than the list comprehension for small list but if you face large arrays, numpy may save you some time.
In [5]: seq = np.random.randint(100, size=1000)
In [6]: %timeit np_smallest(seq)
100000 loops, best of 3: 10.1 µs per loop
In [7]: %timeit smallest(seq)
1000 loops, best of 3: 194 µs per loop
Here is my solution:
def all_smallest(seq):
"""Takes sequence, returns list of all smallest elements"""
min_i = min(seq)
amount = seq.count(min_i)
ans = []
if amount > 1:
for n, i in enumerate(seq):
if i == min_i:
ans.append(n)
if len(ans) == amount:
return ans
return [seq.index(min_i)]
Code very straightforward I think here all clear without any explanation.

Manipulating data from python Numpy array: Using values from one column to sum over adjacent value

here is what my data looks like:
a = np.array([[1,2],[2,1],[7,1],[3,2]])
I want to sum for each number in the second row here. So, in the example, there are two possible values in second column: 1 and 2.
I want to sum all values in the first column that have the same value in second column.
Is there an inbuilt numpy function for this?
For example a sum for each 1 in the second column would be: 2 + 7 = 9
A short but a bit dodgy way is through numpy function bincount:
np.bincount(a[:,1], weights=a[:,0])
What it does is counts the number of occurrences of 0, 1, 2, etc in the array (in this case, a[:,1] which is the list of your category numbers). Now, weights is multiplying the count by some weight which is in this case your first value in a list, essentially making a sum this way.
What it return is this:
array([ 0., 9., 4.])
where 0 is the sum of first elements where the second element is 0, etc... So, it will only work if your second numbers by which you group are integers.
If they are not consecutive integers starting from 0, you can select those you need by doing:
np.bincount(a[:,1], weights=a[:,0])[np.unique(a[:,1])]
This will return
array([9., 4.])
which is an array of sums, sorted by the second element (because unique returns a sorted list).
If your second elements are not integers, first off you are in some kind of trouble because of floating point arithmetic (elements which you think are equal could be different in reality). However, if you are sure it is fine, you can sort them and assign integers to them (using scipy's rank function, for example):
ind = rd(a[:,1], method = 'dense').astype(int) - 1 # ranking begins from 1, we need from 0
sums = np.bincount(ind, weights=a[:,0])
This will return array([9., 4.]), in order sorted by your second element. You can zip them to pair sums with appropriate elements:
zip(np.unique(a[:,1]), sums)
Contents of play.py
import numpy as np
def compute_sum1(a):
unique = np.unique(a[:, 1])
same_idxs = ((u, np.argwhere(a[:, 1] == u)) for u in unique)
# First coordinate of tuple contains value of col 2
# Second coordinate contains the sum of entries from col 1
same_sum = [(u, np.sum(a[idx, 0])) for u, idx in same_idxs]
return same_sum
def compute_sum2(a):
"""A minimal implementation of compute_sum"""
unique = np.unique(a[:, 1])
same_idxs = (np.argwhere(a[:, 1] == u) for u in unique)
same_sum = (np.sum(a[idx, 0]) for idx in same_idxs)
return same_sum
def compute_sum3(a):
unique = np.unique(a[:, 1])
same_idxs = [np.argwhere(a[:, 1] == u) for u in unique]
same_sum = np.sum(a[same_idxs, 0].squeeze(), 1)
return same_sum
def main():
a = np.array([[1,2],[2,1],[7,1],[3,2]]).astype("float")
print("compute_sum1")
print(compute_sum1(a))
print("compute_sum3")
print(compute_sum3(a))
print("compute_sum2")
same_sum = [s for s in compute_sum2(a)]
print(same_sum)
if __name__ == '__main__':
main()
Output:
In [59]: play.main()
compute_sum1
[(1.0, 9.0), (2.0, 4.0)]
compute_sum3
[ 9. 4.]
compute_sum2
[9.0, 4.0]
In [60]: %timeit play.compute_sum1(a)
10000 loops, best of 3: 95 µs per loop
In [61]: %timeit play.compute_sum2(a)
100000 loops, best of 3: 14.1 µs per loop
In [62]: %timeit play.compute_sum3(a)
10000 loops, best of 3: 77.4 µs per loop
Note that compute_sum2() is the fastest.
If your matrix is huge, I suggest using this implementation as it uses generator comprehension instead of list comprehension, which is more memory efficient.
Similarly, same_sum in compute_sum1() can be converted to a generator comprehension by replacing [] with ().
You might want to have a look at this library: https://github.com/ml31415/accumarray . It's a clone from matlabs accumarray for numpy.
a = np.array([[1,2],[2,1],[7,1],[3,2]])
accum(a[:,1], a[:,0])
>>> array([0, 9, 4])
The first 0 means, that there were no fields with 0 in the index column.
The easiest straightforward way I see is though list comprehension:
s = [[sum(x[0] for x in a if x[1] == y), y] for y in set([q[1] for q in a])]
However, if the second number in your lists represents some kind of a category, I suggest you convert your data into a dictionary.
As far as I know, there is no function to do this in numpy, but this can easily be done with pandas.DataFrame.groupby.
In [7]: import pandas as pd
In [8]: import numpy as np
In [9]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [10]: df = pd.DataFrame(a)
In [11]: df.groupby(1)[0].sum()
Out[11]:
1
1 9
2 4
Name: 0, dtype: int64
Of course, you could do the same thing with itertools.groupby
In [1]: import numpy as np
...: from itertools import groupby
...: from operator import itemgetter
...:
In [3]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [4]: sa = sorted(a.tolist(), key=itemgetter(1))
In [5]: grouper = groupby(sa, key=itemgetter(1))
In [6]: sums = {idx : sum(row[0] for row in group) for idx, group in grouper}
In [7]: sums
Out[7]: {1: 9, 2: 4}

Python: Finding differences between elements of a list

Given a list of numbers, how does one find differences between every (i)-th elements and its (i+1)-th?
Is it better to use a lambda expression or maybe a list comprehension?
For example:
Given a list t=[1,3,6,...], the goal is to find a list v=[2,3,...] because 3-1=2, 6-3=3, etc.
>>> t
[1, 3, 6]
>>> [j-i for i, j in zip(t[:-1], t[1:])] # or use itertools.izip in py2k
[2, 3]
The other answers are correct but if you're doing numerical work, you might want to consider numpy. Using numpy, the answer is:
v = numpy.diff(t)
If you don't want to use numpy nor zip, you can use the following solution:
>>> t = [1, 3, 6]
>>> v = [t[i+1]-t[i] for i in range(len(t)-1)]
>>> v
[2, 3]
Starting in Python 3.10, with the new pairwise function it's possible to slide through pairs of elements and thus map on rolling pairs:
from itertools import pairwise
[y-x for (x, y) in pairwise([1, 3, 6, 7])]
# [2, 3, 1]
The intermediate result being:
pairwise([1, 3, 6, 7])
# [(1, 3), (3, 6), (6, 7)]
You can use itertools.tee and zip to efficiently build the result:
from itertools import tee
# python2 only:
#from itertools import izip as zip
def differences(seq):
iterable, copied = tee(seq)
next(copied)
for x, y in zip(iterable, copied):
yield y - x
Or using itertools.islice instead:
from itertools import islice
def differences(seq):
nexts = islice(seq, 1, None)
for x, y in zip(seq, nexts):
yield y - x
You can also avoid using the itertools module:
def differences(seq):
iterable = iter(seq)
prev = next(iterable)
for element in iterable:
yield element - prev
prev = element
All these solution work in constant space if you don't need to store all the results and support infinite iterables.
Here are some micro-benchmarks of the solutions:
In [12]: L = range(10**6)
In [13]: from collections import deque
In [15]: %timeit deque(differences_tee(L), maxlen=0)
10 loops, best of 3: 122 ms per loop
In [16]: %timeit deque(differences_islice(L), maxlen=0)
10 loops, best of 3: 127 ms per loop
In [17]: %timeit deque(differences_no_it(L), maxlen=0)
10 loops, best of 3: 89.9 ms per loop
And the other proposed solutions:
In [18]: %timeit [x[1] - x[0] for x in zip(L[1:], L)]
10 loops, best of 3: 163 ms per loop
In [19]: %timeit [L[i+1]-L[i] for i in range(len(L)-1)]
1 loops, best of 3: 395 ms per loop
In [20]: import numpy as np
In [21]: %timeit np.diff(L)
1 loops, best of 3: 479 ms per loop
In [35]: %%timeit
...: res = []
...: for i in range(len(L) - 1):
...: res.append(L[i+1] - L[i])
...:
1 loops, best of 3: 234 ms per loop
Note that:
zip(L[1:], L) is equivalent to zip(L[1:], L[:-1]) since zip already terminates on the shortest input, however it avoids a whole copy of L.
Accessing the single elements by index is very slow because every index access is a method call in python
numpy.diff is slow because it has to first convert the list to a ndarray. Obviously if you start with an ndarray it will be much faster:
In [22]: arr = np.array(L)
In [23]: %timeit np.diff(arr)
100 loops, best of 3: 3.02 ms per loop
I would suggest using
v = np.diff(t)
this is simple and easy to read.
But if you want v to have the same length as t then
v = np.diff([t[0]] + t) # for python 3.x
or
v = np.diff(t + [t[-1]])
FYI: this will only work for lists.
for numpy arrays
v = np.diff(np.append(t[0], t))
Using the := walrus operator available in Python 3.8+:
>>> t = [1, 3, 6]
>>> prev = t[0]; [-prev + (prev := x) for x in t[1:]]
[2, 3]
A functional approach:
>>> import operator
>>> a = [1,3,5,7,11,13,17,21]
>>> map(operator.sub, a[1:], a[:-1])
[2, 2, 2, 4, 2, 4, 4]
Using generator:
>>> import operator, itertools
>>> g1,g2 = itertools.tee((x*x for x in xrange(5)),2)
>>> list(itertools.imap(operator.sub, itertools.islice(g1,1,None), g2))
[1, 3, 5, 7]
Using indices:
>>> [a[i+1]-a[i] for i in xrange(len(a)-1)]
[2, 2, 2, 4, 2, 4, 4]
Ok. I think I found the proper solution:
v = [x[0]-x[1] for x in zip(t[1:],t[:-1])]
I suspect this is what the numpy diff command does anyway, but just for completeness you can simply difference the sub-vectors:
from numpy import array as a
a(x[1:])-a(x[:-1])
In addition, I wanted to add these solutions to generalizations of the question:
Solution with periodic boundaries
Sometimes with numerical integration you will want to difference a list with periodic boundary conditions (so the first element calculates the difference to the last. In this case the numpy.roll function is helpful:
v-np.roll(v,1)
Solutions with zero prepended
Another numpy solution (just for completeness) is to use
numpy.ediff1d(v)
This works as numpy.diff, but only on a vector (it flattens the input array). It offers the ability to prepend or append numbers to the resulting vector. This is useful when handling accumulated fields that is often the case fluxes in meteorological variables (e.g. rain, latent heat etc), as you want a resulting list of the same length as the input variable, with the first entry untouched.
Then you would write
np.ediff1d(v,to_begin=v[0])
Of course, you can also do this with the np.diff command, in this case though you need to prepend zero to the series with the prepend keyword:
np.diff(v,prepend=0.0)
All the above solutions return a vector that is the same length as the input.
You can also convert the difference into an easily readable transition matrix using
v = t.reshape((c,r)).T - t.T
Where c = number of items in the list and r = 1 since a list is basically a vector or a 1d array.
My way
>>>v = [1,2,3,4,5]
>>>[v[i] - v[i-1] for i, value in enumerate(v[1:], 1)]
[1, 1, 1, 1]

Is there a NumPy function to return the first index of something in an array?

I know there is a method for a Python list to return the first index of something:
>>> xs = [1, 2, 3]
>>> xs.index(2)
1
Is there something like that for NumPy arrays?
Yes, given an array, array, and a value, item to search for, you can use np.where as:
itemindex = numpy.where(array == item)
The result is a tuple with first all the row indices, then all the column indices.
For example, if an array is two dimensions and it contained your item at two locations then
array[itemindex[0][0]][itemindex[1][0]]
would be equal to your item and so would be:
array[itemindex[0][1]][itemindex[1][1]]
If you need the index of the first occurrence of only one value, you can use nonzero (or where, which amounts to the same thing in this case):
>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6
If you need the first index of each of many values, you could obviously do the same as above repeatedly, but there is a trick that may be faster. The following finds the indices of the first element of each subsequence:
>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)
Notice that it finds the beginning of both subsequence of 3s and both subsequences of 8s:
[1, 1, 1, 2, 2, 3, 8, 3, 8, 8]
So it's slightly different than finding the first occurrence of each value. In your program, you may be able to work with a sorted version of t to get what you want:
>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)
You can also convert a NumPy array to list in the air and get its index. For example,
l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i
It will print 1.
Just to add a very performant and handy numba alternative based on np.ndenumerate to find the first index:
from numba import njit
import numpy as np
#njit
def index(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
# If no item was found return None, other return types might be a problem due to
# numbas type inference.
This is pretty fast and deals naturally with multidimensional arrays:
>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2
>>> index(arr1, 2)
(2, 2, 2)
>>> arr2 = np.ones(20)
>>> arr2[5] = 2
>>> index(arr2, 2)
(5,)
This can be much faster (because it's short-circuiting the operation) than any approach using np.where or np.nonzero.
However np.argwhere could also deal gracefully with multidimensional arrays (you would need to manually cast it to a tuple and it's not short-circuited) but it would fail if no match is found:
>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)
l.index(x) returns the smallest i such that i is the index of the first occurrence of x in the list.
One can safely assume that the index() function in Python is implemented so that it stops after finding the first match, and this results in an optimal average performance.
For finding an element stopping after the first match in a NumPy array use an iterator (ndenumerate).
In [67]: l=range(100)
In [68]: l.index(2)
Out[68]: 2
NumPy array:
In [69]: a = np.arange(100)
In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)
Note that both methods index() and next return an error if the element is not found. With next, one can use a second argument to return a special value in case the element is not found, e.g.
In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)
There are other functions in NumPy (argmax, where, and nonzero) that can be used to find an element in an array, but they all have the drawback of going through the whole array looking for all occurrences, thus not being optimized for finding the first element. Note also that where and nonzero return arrays, so you need to select the first element to get the index.
In [71]: np.argmax(a==2)
Out[71]: 2
In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)
In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)
Time comparison
Just checking that for large arrays the solution using an iterator is faster when the searched item is at the beginning of the array (using %timeit in the IPython shell):
In [285]: a = np.arange(100000)
In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop
In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop
In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop
This is an open NumPy GitHub issue.
See also: Numpy: find first index of value fast
If you're going to use this as an index into something else, you can use boolean indices if the arrays are broadcastable; you don't need explicit indices. The absolute simplest way to do this is to simply index based on a truth value.
other_array[first_array == item]
Any boolean operation works:
a = numpy.arange(100)
other_array[first_array > 50]
The nonzero method takes booleans, too:
index = numpy.nonzero(first_array == item)[0][0]
The two zeros are for the tuple of indices (assuming first_array is 1D) and then the first item in the array of indices.
For one-dimensional sorted arrays, it would be much more simpler and efficient O(log(n)) to use numpy.searchsorted which returns a NumPy integer (position). For example,
arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)
Just make sure the array is already sorted
Also check if returned index i actually contains the searched element, since searchsorted's main objective is to find indices where elements should be inserted to maintain order.
if arr[i] == 3:
print("present")
else:
print("not present")
For 1D arrays, I'd recommend np.flatnonzero(array == value)[0], which is equivalent to both np.nonzero(array == value)[0][0] and np.where(array == value)[0][0] but avoids the ugliness of unboxing a 1-element tuple.
To index on any criteria, you can so something like the following:
In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
.....: print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4
And here's a quick function to do what list.index() does, except doesn't raise an exception if it's not found. Beware -- this is probably very slow on large arrays. You can probably monkey patch this on to arrays if you'd rather use it as a method.
def ndindex(ndarray, item):
if len(ndarray.shape) == 1:
try:
return [ndarray.tolist().index(item)]
except:
pass
else:
for i, subarray in enumerate(ndarray):
try:
return [i] + ndindex(subarray, item)
except:
pass
In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]
An alternative to selecting the first element from np.where() is to use a generator expression together with enumerate, such as:
>>> import numpy as np
>>> x = np.arange(100) # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2
For a two dimensional array one would do:
>>> x = np.arange(100).reshape(10,10) # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x)
... for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)
The advantage of this approach is that it stops checking the elements of the array after the first match is found, whereas np.where checks all elements for a match. A generator expression would be faster if there's match early in the array.
There are lots of operations in NumPy that could perhaps be put together to accomplish this. This will return indices of elements equal to item:
numpy.nonzero(array - item)
You could then take the first elements of the lists to get a single element.
Comparison of 8 methods
TL;DR:
(Note: applicable to 1d arrays under 100M elements.)
For maximum performance use index_of__v5 (numba + numpy.enumerate + for loop; see the code below).
If numba is not available:
Use index_of__v7 (for loop + enumerate) if the target value is expected to be found within the first 100k elements.
Else use index_of__v2/v3/v4 (numpy.argmax or numpy.flatnonzero based).
Powered by perfplot
import numpy as np
from numba import njit
# Based on: numpy.argmax()
# Proposed by: John Haberstroh (https://stackoverflow.com/a/67497472/7204581)
def index_of__v1(arr: np.array, v):
is_v = (arr == v)
return is_v.argmax() if is_v.any() else -1
# Based on: numpy.argmax()
def index_of__v2(arr: np.array, v):
return (arr == v).argmax() if v in arr else -1
# Based on: numpy.flatnonzero()
# Proposed by: 1'' (https://stackoverflow.com/a/42049655/7204581)
def index_of__v3(arr: np.array, v):
idxs = np.flatnonzero(arr == v)
return idxs[0] if len(idxs) > 0 else -1
# Based on: numpy.argmax()
def index_of__v4(arr: np.array, v):
return np.r_[False, (arr == v)].argmax() - 1
# Based on: numba, for loop
# Proposed by: MSeifert (https://stackoverflow.com/a/41578614/7204581)
#njit
def index_of__v5(arr: np.array, v):
for idx, val in np.ndenumerate(arr):
if val == v:
return idx[0]
return -1
# Based on: numpy.ndenumerate(), for loop
def index_of__v6(arr: np.array, v):
return next((idx[0] for idx, val in np.ndenumerate(arr) if val == v), -1)
# Based on: enumerate(), for loop
# Proposed by: Noyer282 (https://stackoverflow.com/a/40426159/7204581)
def index_of__v7(arr: np.array, v):
return next((idx for idx, val in enumerate(arr) if val == v), -1)
# Based on: list.index()
# Proposed by: Hima (https://stackoverflow.com/a/23994923/7204581)
def index_of__v8(arr: np.array, v):
l = list(arr)
try:
return l.index(v)
except ValueError:
return -1
Go to Colab
The numpy_indexed package (disclaimer, I am its author) contains a vectorized equivalent of list.index for numpy.ndarray; that is:
sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]
import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx) # [2, -1]
This solution has vectorized performance, generalizes to ndarrays, and has various ways of dealing with missing values.
There is a fairly idiomatic and vectorized way to do this built into numpy. It uses a quirk of the np.argmax() function to accomplish this -- if many values match, it returns the index of the first match. The trick is that for booleans, there will only ever be two values: True (1) and False (0). Therefore, the returned index will be that of the first True.
For the simple example provided, you can see it work with the following
>>> np.argmax(np.array([1,2,3]) == 2)
1
A great example is computing buckets, e.g. for categorizing. Let's say you have an array of cut points, and you want the "bucket" that corresponds to each element of your array. The algorithm is to compute the first index of cuts where x < cuts (after padding cuts with np.Infitnity). I could use broadcasting to broadcast the comparisons, then apply argmax along the cuts-broadcasted axis.
>>> cuts = np.array([10, 50, 100])
>>> cuts_pad = np.array([*cuts, np.Infinity])
>>> x = np.array([7, 11, 80, 443])
>>> bins = np.argmax( x[:, np.newaxis] < cuts_pad[np.newaxis, :], axis = 1)
>>> print(bins)
[0, 1, 2, 3]
As expected, each value from x falls into one of the sequential bins, with well-defined and easy to specify edge case behavior.
Another option not previously mentioned is the bisect module, which also works on lists, but requires a pre-sorted list/array:
import bisect
import numpy as np
z = np.array([104,113,120,122,126,138])
bisect.bisect_left(z, 122)
yields
3
bisect also returns a result when the number you're looking for doesn't exist in the array, so that the number can be inserted in the correct place.
Note: this is for python 2.7 version
You can use a lambda function to deal with the problem, and it works both on NumPy array and list.
your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]
import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]
And you can use
result[0]
to get the first index of the filtered elements.
For python 3.6, use
list(result)
instead of
result
Use ndindex
Sample array
arr = np.array([[1,4],
[2,3]])
print(arr)
...[[1,4],
[2,3]]
create an empty list to store the index and the element tuples
index_elements = []
for i in np.ndindex(arr.shape):
index_elements.append((arr[i],i))
convert the list of tuples into dictionary
index_elements = dict(index_elements)
The keys are the elements and the values are their
indices - use keys to access the index
index_elements[4]
output
... (0,1)
For my use case, I could not sort the array ahead of time because the order of the elements is important. This is my all-NumPy implementation:
import numpy as np
# The array in question
arr = np.array([1,2,1,2,1,5,5,3,5,9])
# Find all of the present values
vals=np.unique(arr)
# Make all indices up-to and including the desired index positive
cum_sum=np.cumsum(arr==vals.reshape(-1,1),axis=1)
# Add zeros to account for the n-1 shape of diff and the all-positive array of the first index
bl_mask=np.concatenate([np.zeros((cum_sum.shape[0],1)),cum_sum],axis=1)>=1
# The desired indices
idx=np.where(np.diff(bl_mask))[1]
# Show results
print(list(zip(vals,idx)))
>>> [(1, 0), (2, 1), (3, 7), (5, 5), (9, 9)]
I believe it accounts for unsorted arrays with duplicate values.
Found another solution with loops:
new_array_of_indicies = []
for i in range(len(some_array)):
if some_array[i] == some_value:
new_array_of_indicies.append(i)
index_lst_form_numpy = pd.DataFrame(df).reset_index()["index"].tolist()

Categories