Related
Lets assume I have the follwing arrays:
distance = np.array([2, 3, 5, 4, 8, 2, 3])
idx = np.array([0, 0, 1, 1, 1, 2, 2 ])
Now I want the smallest distance within one index. So my goald would be:
result = [2, 4, 2]
My only idea right now would be something like this:
for i in idx_unique:
result.append(np.amin(distances[np.argwhere(idx = i)]))
But is there a faster way without a loop??
You can convert idx to a boolean vector to use indexing within the distance vector:
distance = np.array([2, 3, 5, 4, 8])
idx = np.array([0, 0, 1, 1, 1]).astype(np.bool)
result = [np.min(distance[~idx]), np.min(distance[idx])]
Although not truly free from loops, here is one way to do that:
import numpy as np
distance = np.array([2, 3, 5, 4, 8, 2, 3])
idx = np.array([0, 0, 1, 1, 1, 2, 2 ])
t = np.split(distance, np.where(idx[:-1] != idx[1:])[0] + 1)
print([np.min(x) for x in t])
Actually, this provides no improvement as both the OP's solution and this one has the same runtime:
res1 = []
def soln1():
for i in idx_unique:
res1.append(np.amin(distances[np.argwhere(idx = i)]))
def soln2():
t = np.split(distance, np.where(idx[:-1] != idx[1:])[0] + 1)
res2 = [np.min(x) for x in t]
Timeit gives:
%timeit soln1
#10000000 loops, best of 5: 24.3 ns per loop
%timeit soln2
#10000000 loops, best of 5: 24.3 ns per loop
This question already has an answer here:
Return array of counts for each feature of input
(1 answer)
Closed 4 years ago.
Imagine you have a 2D-array (as a NumPy int array) like:
[[2,2,3,3],
[2,3,3,3],
[3,3,4,4]]
Now you want to get an array of the same shape, but instead of the original values, you want to replace the number by its occurrences. Which means, the number 2 changes to 3, since it occurred 3 times, the 3s become 7s and the 4s become 2s.
So the output would be:
[[3,3,7,7],
[3,7,7,7],
[7,7,2,2]]
My solution was first to create a dictionary, which saves all original values as keys and as values the number of occurrences. But for arrays of shape 2000x2000, this seemed to be quite slow.
How could I achieve this more efficiently?
Thanks!
I believe you should be able to stay in NumPy here by using return_inverse within np.unique():
If True, also return the indices of the unique array (for the
specified axis, if provided) that can be used to reconstruct ar.
>>> import numpy as np
>>> a = np.array([[2,2,3,3],
... [2,3,3,3],
... [3,3,4,4]])
>>> _, inv, cts = np.unique(a, return_inverse=True, return_counts=True)
>>> cts[inv].reshape(a.shape)
array([[3, 3, 7, 7],
[3, 7, 7, 7],
[7, 7, 2, 2]])
This will also work for the case where the flattened array is not sorted, such as b = np.array([[1, 2, 4], [4, 4, 1]]).
One way is to use numpy.unique to extract value counts.
Then convert to a dictionary and use numpy.vectorize to utilise this dictionary mapping.
import numpy as np
A = np.array([[2,2,3,3],
[2,3,3,3],
[3,3,4,4]])
d = dict(zip(*np.unique(A.ravel(), return_counts=True)))
res = np.vectorize(d.get)(A)
array([[3, 3, 7, 7],
[3, 7, 7, 7],
[7, 7, 2, 2]], dtype=int64)
Performance
I see the above method takes ~2s for a 2000x2000 array versus 3s via a collections.Counter dictionary-based method. But pure numpy solutions by PaulPanzer and BradSolomon are faster still.
import numpy as np
from collections import Counter
A = np.random.randint(0, 10, (2000, 2000))
MAX_LOOKUP = 2**24
def map_count(A):
d = dict(zip(*np.unique(A.ravel(), return_counts=True)))
return np.vectorize(d.get)(A)
def map_count2(A):
d = Counter(A.ravel())
return np.vectorize(d.get)(A)
def bs(A):
_, inv, cts = np.unique(A, return_inverse=True, return_counts=True)
return cts[inv].reshape(A.shape)
def pp(a):
mn, mx = a.min(), a.max()
span = mx-mn+1
if span > MAX_LOOKUP:
raise RuntimeError('values spread to wide')
a = a - mn
return np.bincount(a.ravel(), None, span)[a]
%timeit map_count(A) # 1.9 s ± 24.2 ms per loop
%timeit map_count2(A) # 3 s ± 33.1 ms per loop
%timeit bs(A) # 887 ms ± 20 ms per loop
%timeit pp(A) # 149 ms ± 6.32 ms per loop
Here is an approach that takes advantage of the fact that your values are int:
MAX_LOOKUP = 2**24
def f_pp(a):
mn, mx = a.min(), a.max()
span = mx-mn+1
if span > MAX_LOOKUP:
raise RuntimeError('values spread to wide')
a = a - mn
return np.bincount(a.ravel(), None, span)[a]
Timings (heavily based on #jpp's work):
>>> from timeit import timeit
>>> kwds = dict(globals=globals(), number=3)
>>>
>>> for l, r in [(0, 10), (0, 1000), (-8000000, 8000000)]:
... a = np.random.randint(l, r, (2000, 2000))
... print(l, r)
... print('mc ', timeit('map_count(a)', **kwds))
... print('mc2', timeit('map_count2(a)', **kwds))
... print('bs ', timeit('bs(a)', **kwds))
... print('pp ', timeit('f_pp(a)', **kwds))
...
0 10
mc 2.462232475867495
mc2 3.820418732939288
bs 1.266723491018638
pp 0.11216754489578307
0 1000
mc 2.972961534978822
mc2 4.3769155589398
bs 2.1607728030066937
pp 0.14146877988241613
-8000000 8000000
mc 10.753600731957704
mc2 8.373655589064583
bs 2.700256273150444
pp 0.7070535880047828
If I try
x = np.append(x, (2,3))
the tuple (2,3) does not get appended to the end of the array, rather 2 and 3 get appended individually, even if I originally declared x as
x = np.array([], dtype = tuple)
or
x = np.array([], dtype = (int,2))
What is the proper way to do this?
I agree with #user2357112 comment:
appending to NumPy arrays is catastrophically slower than appending to ordinary lists. It's an operation that they are not at all designed for
Here's a little benchmark:
# measure execution time
import timeit
import numpy as np
def f1(num_iterations):
x = np.dtype((np.int32, (2, 1)))
for i in range(num_iterations):
x = np.append(x, (i, i))
def f2(num_iterations):
x = np.array([(0, 0)])
for i in range(num_iterations):
x = np.vstack((x, (i, i)))
def f3(num_iterations):
x = []
for i in range(num_iterations):
x.append((i, i))
x = np.array(x)
N = 50000
print timeit.timeit('f1(N)', setup='from __main__ import f1, N', number=1)
print timeit.timeit('f2(N)', setup='from __main__ import f2, N', number=1)
print timeit.timeit('f3(N)', setup='from __main__ import f3, N', number=1)
I wouldn't use neither np.append nor vstack, I'd just create my python array properly and then use it to construct the np.array
EDIT
Here's the benchmark output on my laptop:
append: 12.4983000173
vstack: 1.60663705793
list: 0.0252208517006
[Finished in 14.3s]
You need to supply the shape to numpy dtype, like so:
x = np.dtype((np.int32, (1,2)))
x = np.append(x,(2,3))
Outputs
array([dtype(('<i4', (2, 3))), 1, 2], dtype=object)
[Reference][1]http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
If I understand what you mean, you can use vstack:
>>> a = np.array([(1,2),(3,4)])
>>> a = np.vstack((a, (4,5)))
>>> a
array([[1, 2],
[3, 4],
[4, 5]])
I do not have any special insight as to why this works, but:
x = np.array([1, 3, 2, (5,7), 4])
mytuple = [(2, 3)]
mytuplearray = np.empty(len(mytuple), dtype=object)
mytuplearray[:] = mytuple
y = np.append(x, mytuplearray)
print(y) # [1 3 2 (5, 7) 4 (2, 3)]
As others have correctly pointed out, this is a slow operation with numpy arrays. If you're just building some code from scratch, try to use some other data type. But if you know your array will always remain small or you're not going to append much or if you have existing code that you need to tweak quickly, then go ahead.
simplest way:
x=np.append(x,None)
x[-1]=(2,3)
np.append is easy to use with a case like:
In [94]: np.append([1,2,3],4)
Out[94]: array([1, 2, 3, 4])
but its first example is harder to understand. It shows the same sort of flat concatenate that bothers you:
>>> np.append([1, 2, 3], [[4, 5, 6], [7, 8, 9]])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
Stripped of dimensional tests, np.append does
In [166]: np.append(np.array([1,2],int),(2,3))
Out[166]: array([1, 2, 2, 3])
In [167]: np.concatenate([np.array([1,2],int),np.array((2,3))])
Out[167]: array([1, 2, 2, 3])
So except for the simplest cases you need to understand what np.array((2,3)) does, and how concatenate handles dimensions.
So apart from the speed issues, np.append can be trickier to use that the interface suggests. The parallels to list append are only superficial.
As for append (or concatenate) with dtype=object (not dtype=tuple) or a compound dtype ('i,i'), I couldn't tell you what happens without testing. At a minimum the inputs should already be arrays, and should have a matching dtype. Otherwise the results can unpredicatable.
edit
Don't trust the timings in https://stackoverflow.com/a/38985245/901925. The functions don't produce the same things.
Corrected functions:
In [233]: def g1(num_iterations):
...: x = np.ones((0,2),int)
...: for i in range(num_iterations):
...: x = np.append(x, [(i, i)], axis=0)
...: return x
...:
...: def g2(num_iterations):
...: x = np.ones((0, 2),int)
...: for i in range(num_iterations):
...: x = np.vstack((x, (i, i)))
...: return x
...:
...: def g3(num_iterations):
...: x = []
...: for i in range(num_iterations):
...: x.append((i, i))
...: x = np.array(x)
...: return x
...:
In [234]: g1(3)
Out[234]:
array([[0, 0],
[1, 1],
[2, 2]])
In [235]: g2(3)
Out[235]:
array([[0, 0],
[1, 1],
[2, 2]])
In [236]: g3(3)
Out[236]:
array([[0, 0],
[1, 1],
[2, 2]])
np.append and np.vstack timings are much closer. Both use np.concatenate to do the actual joining. They differ in how the inputs are processed prior to sending them to concatenate.
In [237]: timeit g1(1000)
9.69 ms ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [238]: timeit g2(1000)
12.8 ms ± 7.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [239]: timeit g3(1000)
537 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The wrong results. Note that f1 produces a 1d object dtype array, because the starting value is object dtype array, and there's not axis parameter. f2 duplicates the starting array.
In [240]: f1(3)
Out[240]: array([dtype(('<i4', (2, 1))), 0, 0, 1, 1, 2, 2], dtype=object)
In [241]: f2(3)
Out[241]:
array([[0, 0],
[0, 0],
[1, 1],
[2, 2]])
Not only is it slower to use np.append or np.vstack in a loop, it is also hard to do it right.
I want to get the rank of each element, so I use argsort in numpy:
np.argsort(np.array((1,1,1,2,2,3,3,3,3)))
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
it give the same element the different rank, can I get the same rank like:
array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't mind a dependency on scipy, you can use scipy.stats.rankdata, with method='min':
In [14]: a
Out[14]: array([1, 1, 1, 2, 2, 3, 3, 3, 3])
In [15]: from scipy.stats import rankdata
In [16]: rankdata(a, method='min')
Out[16]: array([1, 1, 1, 4, 4, 6, 6, 6, 6])
Note that rankdata starts the ranks at 1. To start at 0, subtract 1 from the result:
In [17]: rankdata(a, method='min') - 1
Out[17]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't want the scipy dependency, you can use numpy.unique to compute the ranking. Here's a function that computes the same result as rankdata(x, method='min') - 1:
import numpy as np
def rankmin(x):
u, inv, counts = np.unique(x, return_inverse=True, return_counts=True)
csum = np.zeros_like(counts)
csum[1:] = counts[:-1].cumsum()
return csum[inv]
For example,
In [137]: x = np.array([60, 10, 0, 30, 20, 40, 50])
In [138]: rankdata(x, method='min') - 1
Out[138]: array([6, 1, 0, 3, 2, 4, 5])
In [139]: rankmin(x)
Out[139]: array([6, 1, 0, 3, 2, 4, 5])
In [140]: a = np.array([1,1,1,2,2,3,3,3,3])
In [141]: rankdata(a, method='min') - 1
Out[141]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
In [142]: rankmin(a)
Out[142]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
By the way, a single call to argsort() does not give ranks. You can find an assortment of approaches to ranking in the question Rank items in an array using Python/NumPy, including how to do it using argsort().
Alternatively, pandas series has a rank method which does what you need with the min method:
import pandas as pd
pd.Series((1,1,1,2,2,3,3,3,3)).rank(method="min")
# 0 1
# 1 1
# 2 1
# 3 4
# 4 4
# 5 6
# 6 6
# 7 6
# 8 6
# dtype: float64
With focus on performance, here's an approach -
def rank_repeat_based(arr):
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))
For a generic case with the elements in input array not already sorted, we would need to use argsort() to keep track of the positions. So, we would have a modified version, like so -
def rank_repeat_based_generic(arr):
sidx = np.argsort(arr,kind='mergesort')
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr[sidx]))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))[sidx.argsort()]
Runtime test
Testing out all the approaches listed thus far to solve the problem on a large dataset.
Sorted array case :
In [96]: arr = np.sort(np.random.randint(1,100,(10000)))
In [97]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 635 µs per loop
In [98]: %timeit rankmin(arr)
1000 loops, best of 3: 495 µs per loop
In [99]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 826 µs per loop
In [100]: %timeit rank_repeat_based(arr)
10000 loops, best of 3: 200 µs per loop
Unsorted case :
In [106]: arr = np.random.randint(1,100,(10000))
In [107]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 963 µs per loop
In [108]: %timeit rankmin(arr)
1000 loops, best of 3: 869 µs per loop
In [109]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 1.17 ms per loop
In [110]: %timeit rank_repeat_based_generic(arr)
1000 loops, best of 3: 1.76 ms per loop
I've written a function for the same purpose. It uses pure python and numpy only. Please have a look. I put comments as well.
def my_argsort(array):
# this type conversion let us work with python lists and pandas series
array = np.array(array)
# create mapping for unique values
# it's a dictionary where keys are values from the array and
# values are desired indices
unique_values = list(set(array))
mapping = dict(zip(unique_values, np.argsort(unique_values)))
# apply mapping to our array
# np.vectorize works similar map(), and can work with dictionaries
array = np.vectorize(mapping.get)(array)
return array
Hope that helps.
Complex solutions are unnecessary for this problem.
> ary = np.sort([1, 1, 1, 2, 2, 3, 3, 3, 3]) # or anything; must be sorted.
> a = np.diff().cumsum(); a
array([0, 0, 1, 1, 2, 2, 2, 2])
> b = np.r_[0, a]; b # ties get first open rank
array([0, 0, 0, 1, 1, 2, 2, 2, 2])
> c = np.flatnonzero(ary[1:] != ary[:-1])
> np.r_[0, 1 + c][b] # ties get last open rank
array([0, 0, 0, 3, 3, 5, 5, 5])
I am surprised this specific question hasn't been asked before, but I really didn't find it on SO nor on the documentation of np.sort.
Say I have a random numpy array holding integers, e.g:
> temp = np.random.randint(1,10, 10)
> temp
array([2, 4, 7, 4, 2, 2, 7, 6, 4, 4])
If I sort it, I get ascending order by default:
> np.sort(temp)
array([2, 2, 2, 4, 4, 4, 4, 6, 7, 7])
but I want the solution to be sorted in descending order.
Now, I know I can always do:
reverse_order = np.sort(temp)[::-1]
but is this last statement efficient? Doesn't it create a copy in ascending order, and then reverses this copy to get the result in reversed order? If this is indeed the case, is there an efficient alternative? It doesn't look like np.sort accepts parameters to change the sign of the comparisons in the sort operation to get things in reverse order.
temp[::-1].sort() sorts the array in place, whereas np.sort(temp)[::-1] creates a new array.
In [25]: temp = np.random.randint(1,10, 10)
In [26]: temp
Out[26]: array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4])
In [27]: id(temp)
Out[27]: 139962713524944
In [28]: temp[::-1].sort()
In [29]: temp
Out[29]: array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2])
In [30]: id(temp)
Out[30]: 139962713524944
>>> a=np.array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4])
>>> np.sort(a)
array([2, 2, 4, 4, 4, 4, 5, 6, 7, 8])
>>> -np.sort(-a)
array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2])
For short arrays I suggest using np.argsort() by finding the indices of the sorted negatived array, which is slightly faster than reversing the sorted array:
In [37]: temp = np.random.randint(1,10, 10)
In [38]: %timeit np.sort(temp)[::-1]
100000 loops, best of 3: 4.65 µs per loop
In [39]: %timeit temp[np.argsort(-temp)]
100000 loops, best of 3: 3.91 µs per loop
Be careful with dimensions.
Let
x # initial numpy array
I = np.argsort(x) or I = x.argsort()
y = np.sort(x) or y = x.sort()
z # reverse sorted array
Full Reverse
z = x[I[::-1]]
z = -np.sort(-x)
z = np.flip(y)
flip changed in 1.15, previous versions 1.14 required axis. Solution: pip install --upgrade numpy.
First Dimension Reversed
z = y[::-1]
z = np.flipud(y)
z = np.flip(y, axis=0)
Second Dimension Reversed
z = y[::-1, :]
z = np.fliplr(y)
z = np.flip(y, axis=1)
Testing
Testing on a 100×10×10 array 1000 times.
Method | Time (ms)
-------------+----------
y[::-1] | 0.126659 # only in first dimension
-np.sort(-x) | 0.133152
np.flip(y) | 0.121711
x[I[::-1]] | 4.611778
x.sort() | 0.024961
x.argsort() | 0.041830
np.flip(x) | 0.002026
This is mainly due to reindexing rather than argsort.
# Timing code
import time
import numpy as np
def timeit(fun, xs):
t = time.time()
for i in range(len(xs)): # inline and map gave much worse results for x[-I], 5*t
fun(xs[i])
t = time.time() - t
print(np.round(t,6))
I, N = 1000, (100, 10, 10)
xs = np.random.rand(I,*N)
timeit(lambda x: np.sort(x)[::-1], xs)
timeit(lambda x: -np.sort(-x), xs)
timeit(lambda x: np.flip(x.sort()), xs)
timeit(lambda x: x[x.argsort()[::-1]], xs)
timeit(lambda x: x.sort(), xs)
timeit(lambda x: x.argsort(), xs)
timeit(lambda x: np.flip(x), xs)
np.flip() and reversed indexed are basically the same. Below is a benchmark using three different methods. It seems np.flip() is slightly faster. Using negation is slower because it is used twice so reversing the array is faster than that.
** Note that np.flip() is faster than np.fliplr() according to my tests.
def sort_reverse(x):
return np.sort(x)[::-1]
def sort_negative(x):
return -np.sort(-x)
def sort_flip(x):
return np.flip(np.sort(x))
arr=np.random.randint(1,10000,size=(1,100000))
%timeit sort_reverse(arr)
%timeit sort_negative(arr)
%timeit sort_flip(arr)
and the results are:
6.61 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.69 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.57 ms ± 58.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hello I was searching for a solution to reverse sorting a two dimensional numpy array, and I couldn't find anything that worked, but I think I have stumbled on a solution which I am uploading just in case anyone is in the same boat.
x=np.sort(array)
y=np.fliplr(x)
np.sort sorts ascending which is not what you want, but the command fliplr flips the rows left to right! Seems to work!
Hope it helps you out!
I guess it's similar to the suggest about -np.sort(-a) above but I was put off going for that by comment that it doesn't always work. Perhaps my solution won't always work either however I have tested it with a few arrays and seems to be OK.
Unfortunately when you have a complex array, only np.sort(temp)[::-1] works properly. The two other methods mentioned here are not effective.
You could sort the array first (Ascending by default) and then apply np.flip()
(https://docs.scipy.org/doc/numpy/reference/generated/numpy.flip.html)
FYI It works with datetime objects as well.
Example:
x = np.array([2,3,1,0])
x_sort_asc=np.sort(x)
print(x_sort_asc)
>>> array([0, 1, 2, 3])
x_sort_desc=np.flip(x_sort_asc)
print(x_sort_desc)
>>> array([3,2,1,0])
Here is a quick trick
In[3]: import numpy as np
In[4]: temp = np.random.randint(1,10, 10)
In[5]: temp
Out[5]: array([5, 4, 2, 9, 2, 3, 4, 7, 5, 8])
In[6]: sorted = np.sort(temp)
In[7]: rsorted = list(reversed(sorted))
In[8]: sorted
Out[8]: array([2, 2, 3, 4, 4, 5, 5, 7, 8, 9])
In[9]: rsorted
Out[9]: [9, 8, 7, 5, 5, 4, 4, 3, 2, 2]
i suggest using this ...
np.arange(start_index, end_index, intervals)[::-1]
for example:
np.arange(10, 20, 0.5)
np.arange(10, 20, 0.5)[::-1]
Then your resault:
[ 19.5, 19. , 18.5, 18. , 17.5, 17. , 16.5, 16. , 15.5,
15. , 14.5, 14. , 13.5, 13. , 12.5, 12. , 11.5, 11. ,
10.5, 10. ]