Numpy: Sample group of indices with different values - python

Given some numpy array a
array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
what is the best way to get all groups of n indices with each of them having a different value in a?
Obviously there is no group larger than the number of unique elements in a, here 4.
So for example, one group of size 4 is
array([0,2,5,13])
Consider that a might be quite long, let's say up to 250k.
If the result gets too large, it might also be desirable not to compute all such groups, but only the first k requested.

For inputs as integers, we can have a solution based on this post -
In [41]: sidx = a.argsort() # use kind='mergesort' for first occurences
In [42]: c = np.bincount(a)
In [43]: np.sort(sidx[np.r_[0,(c[c!=0])[:-1].cumsum()]])
Out[43]: array([ 0, 2, 5, 13])
Another closely related to previous method for generic inputs -
In [44]: b = a[sidx]
In [45]: np.sort(sidx[np.r_[True,b[:-1]!=b[1:]]])
Out[45]: array([ 0, 2, 5, 13])
Another with numba for memory-efficiency and hence performance too, to select first indices along those unique groups and also with the additional k arg -
from numba import njit
#njit
def _numba1(a, notfound, out, k):
iterID = 0
for i,e in enumerate(a):
if notfound[e]:
notfound[e] = False
out[iterID] = i
iterID += 1
if iterID>=k:
break
return out
def unique_elems(a, k, maxnum=None):
# feed in max of the input array as maxnum value if known
if maxnum is None:
L = a.max()+1
else:
L = maxnum+1
notfound = np.ones(L, dtype=bool)
out = np.ones(k, dtype=a.dtype)
return _numba1(a, notfound, out, k)
Sample run -
In [16]: np.random.seed(0)
...: a = np.random.randint(0,10,200)
In [17]: a
Out[17]:
array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5, 9,
8, 9, 4, 3, 0, 3, 5, 0, 2, 3, 8, 1, 3, 3, 3, 7, 0, 1, 9, 9, 0, 4,
7, 3, 2, 7, 2, 0, 0, 4, 5, 5, 6, 8, 4, 1, 4, 9, 8, 1, 1, 7, 9, 9,
3, 6, 7, 2, 0, 3, 5, 9, 4, 4, 6, 4, 4, 3, 4, 4, 8, 4, 3, 7, 5, 5,
0, 1, 5, 9, 3, 0, 5, 0, 1, 2, 4, 2, 0, 3, 2, 0, 7, 5, 9, 0, 2, 7,
2, 9, 2, 3, 3, 2, 3, 4, 1, 2, 9, 1, 4, 6, 8, 2, 3, 0, 0, 6, 0, 6,
3, 3, 8, 8, 8, 2, 3, 2, 0, 8, 8, 3, 8, 2, 8, 4, 3, 0, 4, 3, 6, 9,
8, 0, 8, 5, 9, 0, 9, 6, 5, 3, 1, 8, 0, 4, 9, 6, 5, 7, 8, 8, 9, 2,
8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8,
2, 3])
In [19]: unique_elems(a, k=6)
Out[19]: array([0, 1, 2, 4, 5, 8])

Use Numpy.unique for this job. There are several other options, one can for instance return the number of times each unique item appears in a.
import numpy as np
# Sample data
a = np.array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
# The unique values are in 'u'
# The indices of the first occurence of the unique values are in 'indices'
u, indices = np.unique(a, return_index=True)

Related

Numpy: convert last axis to list [duplicate]

This question already has an answer here:
Numpy to list over 2nd axis
(1 answer)
Closed 2 years ago.
Let numpy array be shape (x, y, z).
I want it to be (x, y) shape with every element being a list of z-length: [a, b, c, ..., z]
Is there any way to do it with numpy methods?
You can use tolist and assign to a preallocated object array:
import numpy as np
a = np.random.randint(0,10,(100,100,100))
def f():
A = np.empty(a.shape[:-1],object)
A[...] = a.tolist()
return A
f()[99,99]
# [4, 5, 9, 2, 8, 9, 9, 6, 8, 5, 7, 9, 8, 7, 6, 1, 9, 6, 2, 9, 0, 7, 0, 1, 2, 8, 4, 4, 7, 0, 1, 2, 3, 8, 9, 6, 0, 1, 4, 7, 0, 7, 9, 3, 9, 1, 8, 7, 1, 2, 3, 6, 6, 2, 7, 0, 2, 8, 7, 0, 0, 1, 8, 2, 6, 3, 5, 4, 9, 6, 9, 0, 2, 5, 9, 5, 3, 7, 0, 1, 9, 0, 8, 2, 0, 7, 3, 6, 9, 9, 4, 4, 3, 8, 4, 7, 4, 2, 1, 8]
type(f()[99,99])
# <class 'list'>
from timeit import timeit
timeit(f,number=100)*10
# 28.67872992530465
I can't imagine why numpy would need such a method. Here is, more or less, a pythonic solution.
import numpy as np
# an example array with shape [2,3,4]
a = np.random.random([2,3,4])
# create the target array shaped [2,3] with 'object' type (accepting other types than numbers).
b = np.array([[None for row in mat] for mat in a])
for i in range(b.shape[0]):
for j in range(b.shape[1]):
b[i,j] = list(a[i,j])

How can I share memory between numpy arrays?

I have a large numpy array of size 100x100. Among these 10000 values, there are only about 50 unique values. So I want to create a second array of length 50, containing these unique values, and then somehow map the large array to the smaller array. Effectively, I want to store just 50 values in my system instead of redundant 10000 values.
Slices of arrays seem to share memory, but as soon as I use specific indexing, memory sharing is lost.
a = np.array([1,2,3,4,5])
b = a[:3]
indices = [0,1,2]
c = a[indices]
print(b,c)
print(np.shares_memory(a,b),np.shares_memory(a,c))
This gives the output:
[1 2 3] [1 2 3]
True False
Even though b and c are referring to the same values of a, b(the slice) shares memory with a while c doesn't. If I execute b[0] = 100, a[0] also becomes 100 since they share memory. That is not the case with c.
I want to make c, which is a collection of values which are all from a, share memory with a.
In general it is not possible to save memory in this way. The reason is that your data consists of 64-bit integers, and pointers are also 64-bit integers, so if you try to store each value exactly once in some auxiliary array and then point at those values, you will end up using basically the same amount of space.
The answer would be different if for example some of your arrays are subsets of other ones, or you if you were storing large types like long strings.
So make a random array with a small set of unique values:
In [45]: x = np.random.randint(0,10,(10,10))
In [46]: x
Out[46]:
array([[4, 3, 8, 5, 4, 8, 8, 1, 8, 1],
[9, 2, 7, 2, 9, 5, 3, 9, 3, 3],
[6, 2, 6, 9, 4, 2, 3, 4, 6, 7],
[1, 0, 2, 1, 0, 9, 4, 2, 6, 2],
[8, 1, 6, 8, 3, 9, 5, 0, 8, 5],
[4, 9, 1, 4, 1, 2, 8, 4, 7, 2],
[4, 5, 2, 4, 8, 0, 1, 4, 4, 7],
[2, 2, 0, 5, 3, 0, 3, 3, 3, 9],
[3, 1, 0, 6, 4, 8, 8, 3, 5, 2],
[7, 5, 9, 2, 8, 0, 8, 1, 7, 8]])
Find the unique ones:
In [48]: np.unique(x)
Out[48]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
better yet the unique values plus an array that lets us map those values on the original:
In [49]: np.unique(x, return_inverse=True)
Out[49]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([4, 3, 8, 5, 4, 8, 8, 1, 8, 1, 9, 2, 7, 2, 9, 5, 3, 9, 3, 3, 6, 2,
6, 9, 4, 2, 3, 4, 6, 7, 1, 0, 2, 1, 0, 9, 4, 2, 6, 2, 8, 1, 6, 8,
3, 9, 5, 0, 8, 5, 4, 9, 1, 4, 1, 2, 8, 4, 7, 2, 4, 5, 2, 4, 8, 0,
1, 4, 4, 7, 2, 2, 0, 5, 3, 0, 3, 3, 3, 9, 3, 1, 0, 6, 4, 8, 8, 3,
5, 2, 7, 5, 9, 2, 8, 0, 8, 1, 7, 8]))
There's a value in the reverse mapping for each element in the original.

How to generate a list of numbers with duplicates base on a certain seed in python

I don't know how to generate a list of numbers with duplicates based on a certain seed.
I have tried using the code below, but it cannot generate numbers that have duplicates
random.seed(3340)
test = random.sample(range(100), 100000)
I think this could work, but I got an error saying "ValueError: Sample larger than population or is negative"
I could implement some functions that can do this, but I think it would be a great idea if I can use some libraries.
random.sample samples without replacement. random.choices samples with replacement, which is what you want:
In [1]: import random
In [2]: random.choices([1, 2], k=10)
Out[2]: [2, 1, 1, 2, 1, 1, 1, 2, 2, 1]
You can also do this with numpy:
In [3]: import numpy
In [4]: numpy.random.randint(0, 10, 100)
Out[4]:
array([7, 6, 3, 3, 8, 5, 9, 5, 4, 5, 1, 5, 8, 2, 4, 3, 9, 3, 5, 7, 9, 6,
2, 3, 5, 8, 4, 9, 3, 3, 0, 8, 4, 4, 7, 2, 8, 4, 4, 9, 1, 1, 7, 1,
3, 1, 1, 5, 1, 7, 5, 1, 9, 6, 0, 4, 8, 9, 9, 4, 7, 6, 0, 5, 1, 8,
4, 8, 9, 8, 5, 4, 3, 0, 2, 6, 4, 4, 2, 3, 0, 6, 7, 3, 5, 9, 3, 7,
4, 1, 7, 6, 7, 8, 7, 6, 0, 5, 1, 0])
I dont know if you're looking for a simpler solution, but you could use indexing in a generator:
population = list(range(100))
sample = [population[random.randint(0,99) for _ in range(100000)]]
You could use this comprehension as well:
random.seed(3340)
test = [random.randrange(100) for _ in range(100000)]

Append arrays of different dimensions to get a single array

l have three vectors (numpy arrays), vector_1, vector_2, vector_3
as follow :
Dimension(vector1)=(200,2048)
Dimension(vector2)=(200,8192)
Dimension(vector3)=(200,32768)
l would like to append these vectors to get vector_4 :
Dimension(vector4)= (200,2048+8192+32768)= (200, 43008)
Add respectively vector1 then vector2 then vector3
l tries the following :
vector4=numpy.concatenate((vector1,vector2,vector3),axis=0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly
and
vector4=numpy.append(vector4,[vector1,vector2,vectors3],axis=0)
TypeError: append() missing 1 required positional argument: 'values'
I believe you are looking for numpy.hstack.
>>> import numpy as np
>>> a = np.arange(4).reshape(2,2)
>>> b = np.arange(6).reshape(2,3)
>>> c = np.arange(8).reshape(2,4)
>>> a
array([[0, 1],
[2, 3]])
>>> b
array([[0, 1, 2],
[3, 4, 5]])
>>> c
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
>>> np.hstack((a,b,c))
array([[0, 1, 0, 1, 2, 0, 1, 2, 3],
[2, 3, 3, 4, 5, 4, 5, 6, 7]])
The error message is pretty much telling you exactly what is the problem:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
But you are doing the opposite, the concatenation axis dimensions match exactly but the others don't. Consider:
In [3]: arr1 = np.random.randint(0,10,(20, 5))
In [4]: arr2 = np.random.randint(0,10,(20, 3))
In [5]: arr3 = np.random.randint(0,10,(20, 11))
Note the dimensions. Just give it the correct axis. So use the second rather than the first:
In [8]: arr1.shape, arr2.shape, arr3.shape
Out[8]: ((20, 5), (20, 3), (20, 11))
In [9]: np.concatenate((arr1, arr2, arr3), axis=1)
Out[9]:
array([[3, 1, 4, 7, 3, 6, 1, 1, 6, 7, 4, 6, 8, 6, 2, 8, 2, 5, 0],
[4, 2, 2, 1, 7, 8, 0, 7, 2, 2, 3, 9, 8, 0, 7, 3, 5, 9, 6],
[2, 8, 9, 8, 5, 3, 5, 8, 5, 2, 4, 1, 2, 0, 3, 2, 9, 1, 0],
[6, 7, 3, 5, 6, 8, 3, 8, 4, 8, 1, 5, 4, 4, 6, 4, 0, 3, 4],
[3, 5, 8, 8, 7, 7, 4, 8, 7, 3, 8, 7, 0, 2, 8, 9, 1, 9, 0],
[5, 4, 8, 3, 7, 8, 3, 2, 7, 8, 2, 4, 8, 0, 6, 9, 2, 0, 3],
[0, 0, 1, 8, 6, 4, 4, 4, 2, 8, 4, 1, 4, 1, 3, 1, 5, 5, 1],
[1, 6, 3, 3, 9, 2, 3, 4, 9, 2, 6, 1, 4, 1, 5, 6, 0, 1, 9],
[4, 5, 4, 7, 1, 4, 0, 8, 8, 1, 6, 0, 4, 6, 3, 1, 2, 5, 2],
[6, 4, 3, 2, 9, 4, 1, 7, 7, 0, 0, 5, 9, 3, 7, 4, 5, 6, 1],
[7, 7, 0, 4, 1, 9, 9, 1, 0, 1, 8, 3, 6, 0, 5, 1, 4, 0, 7],
[7, 9, 0, 4, 0, 5, 5, 9, 8, 9, 9, 7, 8, 8, 2, 6, 2, 3, 1],
[4, 1, 6, 5, 4, 5, 6, 7, 9, 2, 5, 8, 6, 6, 6, 8, 2, 3, 1],
[7, 7, 8, 5, 0, 8, 5, 6, 4, 4, 3, 5, 9, 8, 7, 9, 8, 8, 1],
[3, 9, 3, 6, 3, 2, 2, 4, 0, 1, 0, 4, 3, 0, 1, 3, 4, 1, 3],
[5, 1, 9, 7, 1, 8, 3, 9, 4, 7, 6, 7, 4, 7, 0, 1, 2, 8, 7],
[6, 3, 8, 0, 6, 2, 1, 8, 1, 0, 0, 3, 7, 2, 1, 5, 7, 0, 7],
[5, 4, 7, 5, 5, 8, 3, 2, 6, 1, 0, 4, 6, 9, 7, 3, 9, 2, 5],
[1, 4, 8, 5, 7, 2, 0, 2, 6, 2, 6, 5, 5, 4, 6, 1, 8, 8, 1],
[4, 4, 5, 6, 2, 6, 0, 5, 1, 8, 4, 5, 8, 9, 2, 1, 0, 4, 2]])
In [10]: np.concatenate((arr1, arr2, arr3), axis=1).shape
Out[10]: (20, 19)

How to find zero elements in a sparse matrix

I know that scipy.sparse.find(A) returns 3 arrays I,J,V each of them containing the rows, columns, and values of the nonzero elements respectively.
What i want is a way to do the same (except the V array) for all zero elements without having to iterate through the matrix since its too large.
Make a small sparse matrix with 10% sparsity:
In [1]: from scipy import sparse
In [2]: M = sparse.random(10,10,.1)
In [3]: M
Out[3]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in COOrdinate format>
The 10 nonzero values:
In [5]: sparse.find(M)
Out[5]:
(array([6, 4, 1, 2, 3, 0, 1, 6, 9, 6], dtype=int32),
array([1, 2, 3, 3, 3, 4, 4, 4, 5, 8], dtype=int32),
array([ 0.91828586, 0.29763717, 0.12771201, 0.24986069, 0.14674883,
0.56018409, 0.28643427, 0.11654358, 0.8784731 , 0.13253971]))
If, out of the 100 elements of the matrix, 10 are nonzero, then 90 elements are zero. Do you really want the indices of all of those?
where or nonzero on the dense equivalent gives the same indices:
In [6]: A = M.A # dense
In [7]: np.where(A)
Out[7]:
(array([0, 1, 1, 2, 3, 4, 6, 6, 6, 9], dtype=int32),
array([4, 3, 4, 3, 3, 2, 1, 4, 8, 5], dtype=int32))
And the indices of the 90 zero values:
In [8]: np.where(A==0)
Out[8]:
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9], dtype=int32),
array([0, 1, 2, 3, 5, 6, 7, 8, 9, 0, 1, 2, 5, 6, 7, 8, 9, 0, 1, 2, 4, 5, 6,
7, 8, 9, 0, 1, 2, 4, 5, 6, 7, 8, 9, 0, 1, 3, 4, 5, 6, 7, 8, 9, 0, 1,
2, 3, 4, 5, 6, 7, 8, 9, 0, 2, 3, 5, 6, 7, 9, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 6, 7, 8, 9], dtype=int32))
That's 2 arrays of shape (90,), 180 integers, as opposed to the 100 values in the the dense array itself. If your sparse matrix is too large to convert to dense, it will be too large to produce all the zero indices (assuming reasonable sparsity).
The print(M) shows the same triplets as the find. The attributes of the coo format also give the nonzero indices:
In [13]: M.row
Out[13]: array([6, 6, 3, 4, 1, 6, 9, 2, 1, 0], dtype=int32)
In [14]: M.col
Out[14]: array([1, 4, 3, 2, 3, 8, 5, 3, 4, 4], dtype=int32)
(Sometimes manipulation of a matrix can set values to 0 without removing them from the attributes. So find/nonzero takes an added step to remove those, if any.)
We could apply find to M==0 as well - but sparse will give us a warning.
In [15]: sparse.find(M==0)
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
", try using != instead.", SparseEfficiencyWarning)
It's the same thing that I've been warning about - the large size of this set. The resulting arrays are the same as in Out[8].
Assuming you have a scipy sparse array and have imported find:
from itertools import product
I, J, _= find(your_sparse_array)
nonzero = zip(I, J)
nrows, ncols = your_sparse_array.shape
for a, b in product(range(nrows), range(ncols)):
if (a,b) not in nonzero: print(a, b)

Categories