Is there a built-in method that would help me achieve the following efficiently: given an array, I need a list of arrays, each with indices to a different unique value of the array?
If f is the desired function,
b = f(a)
and
u, idxs = unique(a)
then
b[i] == where(idxs==i)[0]
I am aware that pandas.Series.groupby() can do this, but it may no be efficient to create a dict when there are over 10^5 unique integers.
If you have numpy >= 1.9 you can do:
>>> a = np.random.randint(5, size=10)
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> unq, unq_inv, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
>>> np.split(np.argsort(unq_inv), np.cumsum(unq_cnt[:-1]))
[array([0]), array([9]), array([1, 4, 8]), array([7]), array([2, 3, 5, 6])]
>>> unq
array([0, 1, 2, 3, 4])
In earlier versions, you can get the counts doing an extra:
>>> unq_cnt = np.bincount(unq_inv)
Also, if you want to make sure that the indices for each value are sorted, I think you will need to use a stable sort, e.g. np.argsort(unq_inv, kind='mergesort')
Thinking about what you seem to be after, which I think is minimizing calls to an expensive function, I don't think you need to do what you are asking. Say that your function was squaring, you could simply do:
>>> unq, unq_inv = np.unique(a, return_inverse=True)
>>> f_unq = unq**2
>>> f_a = f_unq[unq_inv]
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> f_a
array([ 0, 4, 16, 16, 4, 16, 16, 9, 4, 1])
def foo(a):
I=np.arange(a.shape[0])
d={}
while a.shape[0]:
x = a[0]
ii = a==x
d[x] = I[ii]
a = a[~ii]
I = I[~ii]
return d
In [767]: a
Out[767]: array([4, 4, 3, 0, 0, 2, 1, 1, 0, 3])
In [768]: foo(a)
Out[768]:
{0: array([3, 4, 8]),
1: array([6, 7]),
2: array([5]),
3: array([2, 9]),
4: array([0, 1])}
Is this the sort of dictionary that you want?
For small a this works fine.
An equivalent dictionary building function is:
def foo1(a):
unq = np.unique(a)
return {i:np.where(a==i)[0] for i in unq}
Off hand I don't see how unq_inv helps with building the dictionary.
foo is about 30% slower than foo1. I was hoping that by reducing the searched array each time a value was counted that I might gain some speed. But it looks like the extra bookkeeping chews up time. And the where time might not be that sensitive to the length of a.
For a2=np.random.randint(5000,size=100000) run times are on the order of 2-3 sec.
But np.random.randint(50000,size=1000000) takes too long to time (for either version).
On further experimentation, a 'dumb' approach using a collections.defaultdict is much faster (20x):
def food(a):
d = defaultdict(list)
for i,j in enumerate(a):
d[j].append(i)
return d
The 'too big' (1000000,) array takes only 1.1 sec;
Maybe do something like:
s = argsort(a)
d = diff(a[s])
starts = where(d)[0]
f = [s[starts[i:i+1]] for i in xrange(len(a))]
(code not checked)
Related
I have an array a = np.array([2, 2, 2, 3, 3, 15, 7, 7, 9]) that continues like that. I would like to shift this array but I'm not sure if I can use np.roll() here.
The array I would like to produce is [0, 0, 0, 2, 2, 3, 15, 15, 7].
As you can see, the first like numbers which are in array a (in this case the three '2's) should be replaced with '0's. Everything should then be shifted such that the '3's are replaced with '2's, the '15' is replaced with the '3' etc. Ideally I would like to do this operation without any for loop as I need it to run quickly.
I realise this operation may be a bit confusing so please ask questions.
If you want to stick with NumPy, you can achieve this using np.unique by returning the counts per unique elements with the return_counts option.
Then, simply roll the values and construct a new array with np.repeat:
>>> s, i, c = np.unique(a, return_index=True, return_counts=True)
(array([ 2, 3, 7, 9, 15]), array([0, 3, 6, 8, 5]), array([3, 2, 2, 1, 1]))
The three outputs are respectively: unique sorted elements, indices of first encounter unique element, and the count per unique element.
np.unique sorts the value, so we need to unsort the values as well as the counts first. We can then shift the values with np.roll:
>>> idx = np.argsort(i)
>>> v = np.roll(s[idx], 1)
>>> v[0] = 0
array([ 0, 2, 3, 15, 7])
Alternatively with np.append, this requires a whole copy though:
>>> v = np.append([0], s[idx][:-1])
array([ 0, 2, 3, 15, 7])
Finally reassemble:
>>> np.repeat(v, c[idx])
array([ 0, 0, 0, 2, 2, 3, 15, 15, 7])
Another - more general - solution that will work when there are recurring values in a. This requires the use of np.diff.
You can get the indices of the elements with:
>>> i = np.diff(np.append(a, [0])).nonzero()[0] + 1
array([3, 5, 6, 8, 9])
>>> idx = np.append([0], i)
array([0, 3, 5, 6, 8, 9])
The values are then given using a[idx]:
>>> v = np.append([0], a)[idx]
array([ 0, 2, 3, 15, 7, 9])
And the counts per element with:
>>> c = np.append(np.diff(i, prepend=0), [0])
array([3, 2, 1, 2, 1, 0])
Finally, reassemble:
>>> np.repeat(v, c)
array([ 0, 0, 0, 2, 2, 3, 15, 15, 7])
This is not using numpy, but one approach that comes to mind is to itertools.groupby to collect contiguous runs of the same elements. Then shift all the elements (by prepending a 0) and use the counts to repeat them.
from itertools import chain, groupby
def shift(data):
values = [(k, len(list(g))) for k,g in groupby(data)]
keys = [0] + [i[0] for i in values]
reps = [i[1] for i in values]
return list(chain.from_iterable([[k]*rep for k, rep in zip(keys, reps)]))
For example
>>> a = np.array([2,2,2,3,3,15,7,7,9])
>>> shift(a)
[0, 0, 0, 2, 2, 3, 15, 15, 7]
You can try this code:
import numpy as np
a = np.array([2, 2, 2, 3, 3, 15, 7, 7, 9])
diff_a=np.diff(a)
idx=np.flatnonzero(diff_a)
val=diff_a[idx]
val=np.insert(val[:-1],0, a[0]) #update value
diff_a[idx]=val
res=np.append([0],np.cumsum(diff_a))
print(res)
You can try this:
import numpy as np
a = np.array([2, 2, 2, 3, 3, 15, 7, 7, 9])
z = a - np.pad(a, (1,0))[:-1]
z[m] = np.pad(z[(m := z!=0)], (1,0))[:-1]
print(z.cumsum())
It gives:
[ 0 0 0 2 2 3 15 15 7]
How can I (efficiently) get each combination of a group of 1D-arrays into a 2D array?
Let's say I have arrays A, B, C, and D and I want to create a 2D array with each combination such that I would have 2D arrays that represent AB, AC, AD, ABC, ABD, ..., CD.
For clarity on my notation above:
A = np.array([1,2,3,4,5])
B = np.array([2,3,4,5,6])
C = np.array([3,4,5,6,7])
so
AB = np.array([1,2,3,4,5], [2,3,4,5,6])
ABC = np.array([1,2,3,4,5], [2,3,4,5,6],[3,4,5,6,7])
So far I have tried something like:
A = np.array([1,2,3,4,5])
B = np.array([2,3,4,5,6])
C = np.array([3,4,5,6,7])
D = np.array([4,5,6,7,8])
stacked = np.vstack((A,B,C,D), axis=0)
combos = []
it2 = itertools.combinations(range(4), r=2)
for i in list(it2):
combos.append(i)
it3 = itertools.combinations(range(4), r=3)
for i in list(it3):
combos.append(i)
it4 = itertools.combinations(range(4), r=4)
for i in list(it4):
combos.append(i)
which gets me a list of all the possible combos. then I can apply something like:
for combo in combos:
stacked[combo,:]
#Then I do something with each combo
And this is where I get stuck
This is fine when it's only A,B,C,D but if I have A,B,C,... X,Y,Z then the approach above doesn't scale as I'd have to call itertools 20+ times.
How can I overcome this and make it more flexible (in practice the number of arrays will likely be 5-10)?
As others have also recommended, use itertools.combinations
import numpy as np
from itertools import combinations
A = np.array([1,2,3,4,5])
B = np.array([2,3,4,5,6])
C = np.array([3,4,5,6,7])
arrays = [A, B, C]
combos = []
for i in range(2, len(arrays) + 1):
combos.extend(combinations(arrays, i))
for combo in combos:
arr = np.vstack(combo) # do stuff with array
You can use an additional outer for-loop:
arrays = np.array([ # let's say your input arrays are stored as one 2d array
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
...
])
combos = []
for r in range(2, len(arrays)+1):
combos.extend(it.combinations(range(len(arrays)), r=r))
When you have N items, there are 2^N combinations, so this will take 2^N iterations.
You can go through these 2^N iterations with a single loop if you use a for loop for the range 0 <= n < (2^N) and use bitwise operations to select the items from the list of items according the the current n.
You could try this:
from itertools import combinations
A = np.array([1,2,3,4,5])
B = np.array([2,3,4,5,6])
C = np.array([3,4,5,6,7])
lst = [A,B,C]
[list(combinations(lst, i)) for i in range(1,len(lst)+1)]
out:
# [[(array([1, 2, 3, 4, 5]),),
# (array([2, 3, 4, 5, 6]),),
# (array([3, 4, 5, 6, 7]),)],
# [(array([1, 2, 3, 4, 5]), array([2, 3, 4, 5, 6])),
# (array([1, 2, 3, 4, 5]), array([3, 4, 5, 6, 7])),
# (array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]))],
# [(array([1, 2, 3, 4, 5]), array([2, 3, 4, 5, 6]), array([3, 4, 5, 6, 7]))]]
I am a beginner with numpy, and I am trying to extract some data from a long numpy array. What I need to do is start from a defined position in my array, and then subsample every nth data point from that position, until the end of my array.
basically if I had
a = [1,2,3,4,1,2,3,4,1,2,3,4....]
I want to subsample this to start at a[1] and then sample every fourth point from there, to produce something like
b = [2,2,2.....]
You can use numpy's slicing, simply start:stop:step.
>>> xs
array([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4])
>>> xs[1::4]
array([2, 2, 2])
This creates a view of the the original data, so it's constant time. It'll also reflect changes to the original array and keep the whole original array in memory:
>>> a
array([1, 2, 3, 4, 5])
>>> b = a[::2] # O(1), constant time
>>> b[:] = 0 # modifying the view changes original array
>>> a # original array is modified
array([0, 2, 0, 4, 0])
so if either of the above things are a problem, you can make a copy explicitly:
>>> a
array([1, 2, 3, 4, 5])
>>> b = a[::2].copy() # explicit copy, O(n)
>>> b[:] = 0 # modifying the copy
>>> a # original is intact
array([1, 2, 3, 4, 5])
This isn't constant time, but the result isn't tied to the original array. The copy also contiguous in memory, which can make some operations on it faster.
Complementary to behzad.nouri's answer:
If you want to control the number of final elements and ensure it's always fixed to a predefined value (rather than controlling a fixed step in between subsamples) you can use numpy's linspace method followed by integer rounding.
For example, with num_elements=4:
>>> a
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> choice = np.round(np.linspace(1, len(a)-1, num=4)).astype(int)
>>> a[choice]
array([ 2, 5, 7, 10])
Or, subsampling an array with final start/end points in general:
>>> import numpy as np
>>> np.round(np.linspace(0, len(a)-1, num=4)).astype(int)
array([0, 3, 6, 9])
>>> np.round(np.linspace(0, len(a)-1, num=15)).astype(int)
array([0, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 8, 9])
Suppose I have some numpy array (all elements are unique) that I want to sort in descending order. I need to find out which positions elements of initial array will take in sorted array.
Example.
In1: [1, 2, 3] # Input
Out1: [2, 1, 0] # Expected output
In2: [1, -2, 2] # Input
Out2: [1, 2, 0] # Expected output
I tried this one:
def find_positions(A):
A = np.array(A)
A_sorted = np.sort(A)[::-1]
return np.argwhere(A[:, None] == A_sorted[None, :])[:, 1]
But it doesn't work when the input array is very large (len > 100000). What I did wrong and how can I resolve it?
Approach #1
We could use double argsort -
np.argsort(a)[::-1].argsort() # a is input array/list
Approach #2
We could use one argsort and then array-assignment -
# https://stackoverflow.com/a/41242285/ #Andras Deak
def argsort_unique(idx):
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
out = argsort_unique(np.argsort(a)[::-1])
Take a look at numpy.argsort(...) function:
Returns the indices that would sort an array.
Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in sorted order.
Here is the reference from the documentation, and the following is a simple example:
import numpy
arr = numpy.random.rand(100000)
indexes = numpy.argsort(arr)
the indexes array will contain all the indexes in the order in which the array arr would be sorted
I face the same problem for plain lists, and would like to avoid using numpy. So I propose a possible solution that should also work for an np.array, and which avoids reversal of the result:
def argsort(A, key=None, reverse=False):
"Indirect sort of list or array A: return indices of elements in order."
keyfunc = (lambda i: A[i]) if key is None else lambda i: key(A[i])
return sorted(range(len(A)), keyfunc, reverse=reverse)
Example of use:
>>> L = [3,1,4,1,5,9,2,6]
>>> argsort( L )
[1, 3, 6, 0, 2, 4, 7, 5]
>>> [L[i]for i in _]
[1, 1, 2, 3, 4, 5, 6, 9]
>>> argsort( L, key=lambda x:(x%2,x) ) # even elements first
[6, 2, 7, 1, 3, 0, 4, 5]
>>> [L[i]for i in _]
[2, 4, 6, 1, 1, 3, 5, 9]
>>> argsort( L, key=lambda x:(x%2,x), reverse = True)
[5, 4, 0, 1, 3, 7, 2, 6]
>>> [L[i]for i in _]
[9, 5, 3, 1, 1, 6, 4, 2]
Feedback would be welcome! (Efficiency compared to previously proposed solutions? Suggestions for improvements?)
I am a beginner with numpy, and I am trying to extract some data from a long numpy array. What I need to do is start from a defined position in my array, and then subsample every nth data point from that position, until the end of my array.
basically if I had
a = [1,2,3,4,1,2,3,4,1,2,3,4....]
I want to subsample this to start at a[1] and then sample every fourth point from there, to produce something like
b = [2,2,2.....]
You can use numpy's slicing, simply start:stop:step.
>>> xs
array([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4])
>>> xs[1::4]
array([2, 2, 2])
This creates a view of the the original data, so it's constant time. It'll also reflect changes to the original array and keep the whole original array in memory:
>>> a
array([1, 2, 3, 4, 5])
>>> b = a[::2] # O(1), constant time
>>> b[:] = 0 # modifying the view changes original array
>>> a # original array is modified
array([0, 2, 0, 4, 0])
so if either of the above things are a problem, you can make a copy explicitly:
>>> a
array([1, 2, 3, 4, 5])
>>> b = a[::2].copy() # explicit copy, O(n)
>>> b[:] = 0 # modifying the copy
>>> a # original is intact
array([1, 2, 3, 4, 5])
This isn't constant time, but the result isn't tied to the original array. The copy also contiguous in memory, which can make some operations on it faster.
Complementary to behzad.nouri's answer:
If you want to control the number of final elements and ensure it's always fixed to a predefined value (rather than controlling a fixed step in between subsamples) you can use numpy's linspace method followed by integer rounding.
For example, with num_elements=4:
>>> a
array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> choice = np.round(np.linspace(1, len(a)-1, num=4)).astype(int)
>>> a[choice]
array([ 2, 5, 7, 10])
Or, subsampling an array with final start/end points in general:
>>> import numpy as np
>>> np.round(np.linspace(0, len(a)-1, num=4)).astype(int)
array([0, 3, 6, 9])
>>> np.round(np.linspace(0, len(a)-1, num=15)).astype(int)
array([0, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 8, 9])