Related
Let's say we have initial array:
test_array = np.array([1, 4, 2, 5, 7, 4, 2, 5, 6, 7, 7, 2, 5])
What is the best way to remap elements in this array by using two other arrays, one that represents elements we want to replace and second one which represents new values which replace them:
map_from = np.array([2, 4, 5])
map_to = np.array([9, 0, 3])
So the results should be:
remaped_array = [1, 0, 9, 3, 7, 0, 9, 3, 6, 7, 7, 9, 3]
There might be a more succinct way of doing this, but this should work by using a mask.
mask = test_array[:,None] == map_from
val = map_to[mask.argmax(1)]
np.where(mask.any(1), val, test_array)
output:
array([1, 0, 9, 3, 7, 0, 9, 3, 6, 7, 7, 9, 3])
If your original array contains only positive integers and their maximum values are not very large, it is easiest to use a mapped array:
>>> a = np.array([1, 4, 2, 5, 7, 4, 2, 5, 6, 7, 7, 2, 5])
>>> mapping = np.arange(a.max() + 1)
>>> map_from = np.array([2, 4, 5])
>>> map_to = np.array([9, 0, 3])
>>> mapping[map_from] = map_to
>>> mapping[a]
array([1, 0, 9, 3, 7, 0, 9, 3, 6, 7, 7, 9, 3])
Here is another general method:
>>> vals, inv = np.unique(a, return_inverse=True)
>>> vals[np.searchsorted(vals, map_from)] = map_to
>>> vals[inv]
array([1, 0, 9, 3, 7, 0, 9, 3, 6, 7, 7, 9, 3])
Given some numpy array a
array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
what is the best way to get all groups of n indices with each of them having a different value in a?
Obviously there is no group larger than the number of unique elements in a, here 4.
So for example, one group of size 4 is
array([0,2,5,13])
Consider that a might be quite long, let's say up to 250k.
If the result gets too large, it might also be desirable not to compute all such groups, but only the first k requested.
For inputs as integers, we can have a solution based on this post -
In [41]: sidx = a.argsort() # use kind='mergesort' for first occurences
In [42]: c = np.bincount(a)
In [43]: np.sort(sidx[np.r_[0,(c[c!=0])[:-1].cumsum()]])
Out[43]: array([ 0, 2, 5, 13])
Another closely related to previous method for generic inputs -
In [44]: b = a[sidx]
In [45]: np.sort(sidx[np.r_[True,b[:-1]!=b[1:]]])
Out[45]: array([ 0, 2, 5, 13])
Another with numba for memory-efficiency and hence performance too, to select first indices along those unique groups and also with the additional k arg -
from numba import njit
#njit
def _numba1(a, notfound, out, k):
iterID = 0
for i,e in enumerate(a):
if notfound[e]:
notfound[e] = False
out[iterID] = i
iterID += 1
if iterID>=k:
break
return out
def unique_elems(a, k, maxnum=None):
# feed in max of the input array as maxnum value if known
if maxnum is None:
L = a.max()+1
else:
L = maxnum+1
notfound = np.ones(L, dtype=bool)
out = np.ones(k, dtype=a.dtype)
return _numba1(a, notfound, out, k)
Sample run -
In [16]: np.random.seed(0)
...: a = np.random.randint(0,10,200)
In [17]: a
Out[17]:
array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5, 9,
8, 9, 4, 3, 0, 3, 5, 0, 2, 3, 8, 1, 3, 3, 3, 7, 0, 1, 9, 9, 0, 4,
7, 3, 2, 7, 2, 0, 0, 4, 5, 5, 6, 8, 4, 1, 4, 9, 8, 1, 1, 7, 9, 9,
3, 6, 7, 2, 0, 3, 5, 9, 4, 4, 6, 4, 4, 3, 4, 4, 8, 4, 3, 7, 5, 5,
0, 1, 5, 9, 3, 0, 5, 0, 1, 2, 4, 2, 0, 3, 2, 0, 7, 5, 9, 0, 2, 7,
2, 9, 2, 3, 3, 2, 3, 4, 1, 2, 9, 1, 4, 6, 8, 2, 3, 0, 0, 6, 0, 6,
3, 3, 8, 8, 8, 2, 3, 2, 0, 8, 8, 3, 8, 2, 8, 4, 3, 0, 4, 3, 6, 9,
8, 0, 8, 5, 9, 0, 9, 6, 5, 3, 1, 8, 0, 4, 9, 6, 5, 7, 8, 8, 9, 2,
8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8,
2, 3])
In [19]: unique_elems(a, k=6)
Out[19]: array([0, 1, 2, 4, 5, 8])
Use Numpy.unique for this job. There are several other options, one can for instance return the number of times each unique item appears in a.
import numpy as np
# Sample data
a = np.array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
# The unique values are in 'u'
# The indices of the first occurence of the unique values are in 'indices'
u, indices = np.unique(a, return_index=True)
I have a large numpy array of size 100x100. Among these 10000 values, there are only about 50 unique values. So I want to create a second array of length 50, containing these unique values, and then somehow map the large array to the smaller array. Effectively, I want to store just 50 values in my system instead of redundant 10000 values.
Slices of arrays seem to share memory, but as soon as I use specific indexing, memory sharing is lost.
a = np.array([1,2,3,4,5])
b = a[:3]
indices = [0,1,2]
c = a[indices]
print(b,c)
print(np.shares_memory(a,b),np.shares_memory(a,c))
This gives the output:
[1 2 3] [1 2 3]
True False
Even though b and c are referring to the same values of a, b(the slice) shares memory with a while c doesn't. If I execute b[0] = 100, a[0] also becomes 100 since they share memory. That is not the case with c.
I want to make c, which is a collection of values which are all from a, share memory with a.
In general it is not possible to save memory in this way. The reason is that your data consists of 64-bit integers, and pointers are also 64-bit integers, so if you try to store each value exactly once in some auxiliary array and then point at those values, you will end up using basically the same amount of space.
The answer would be different if for example some of your arrays are subsets of other ones, or you if you were storing large types like long strings.
So make a random array with a small set of unique values:
In [45]: x = np.random.randint(0,10,(10,10))
In [46]: x
Out[46]:
array([[4, 3, 8, 5, 4, 8, 8, 1, 8, 1],
[9, 2, 7, 2, 9, 5, 3, 9, 3, 3],
[6, 2, 6, 9, 4, 2, 3, 4, 6, 7],
[1, 0, 2, 1, 0, 9, 4, 2, 6, 2],
[8, 1, 6, 8, 3, 9, 5, 0, 8, 5],
[4, 9, 1, 4, 1, 2, 8, 4, 7, 2],
[4, 5, 2, 4, 8, 0, 1, 4, 4, 7],
[2, 2, 0, 5, 3, 0, 3, 3, 3, 9],
[3, 1, 0, 6, 4, 8, 8, 3, 5, 2],
[7, 5, 9, 2, 8, 0, 8, 1, 7, 8]])
Find the unique ones:
In [48]: np.unique(x)
Out[48]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
better yet the unique values plus an array that lets us map those values on the original:
In [49]: np.unique(x, return_inverse=True)
Out[49]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([4, 3, 8, 5, 4, 8, 8, 1, 8, 1, 9, 2, 7, 2, 9, 5, 3, 9, 3, 3, 6, 2,
6, 9, 4, 2, 3, 4, 6, 7, 1, 0, 2, 1, 0, 9, 4, 2, 6, 2, 8, 1, 6, 8,
3, 9, 5, 0, 8, 5, 4, 9, 1, 4, 1, 2, 8, 4, 7, 2, 4, 5, 2, 4, 8, 0,
1, 4, 4, 7, 2, 2, 0, 5, 3, 0, 3, 3, 3, 9, 3, 1, 0, 6, 4, 8, 8, 3,
5, 2, 7, 5, 9, 2, 8, 0, 8, 1, 7, 8]))
There's a value in the reverse mapping for each element in the original.
I want to implement a vectorized SGD algorithm and would like to generate multiple mini batches at once.
Suppose data = np.arange(0, 100), miniBatchSize=10, n_miniBatches=10 and indices = np.random.randint(0, n_miniBatches, 5) (5 mini batches). What I would like to achieve is
miniBatches = np.zeros(5, miniBatchSize)
for i in range(5):
miniBatches[i] = data[indices[i]: indices[i] + miniBatchSize]
Is there any way to avoid for loop?
Thanks!
It can be done using stride tricks:
from numpy.lib.stride_tricks import as_strided
a = as_strided(data[:n_miniBatches], shape=(miniBatchSize, n_miniBatches), strides=2*data.strides, writeable=False)
miniBatches = a[:, indices].T
# E.g. indices = array([0, 7, 1, 0, 0])
Output:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
I know that scipy.sparse.find(A) returns 3 arrays I,J,V each of them containing the rows, columns, and values of the nonzero elements respectively.
What i want is a way to do the same (except the V array) for all zero elements without having to iterate through the matrix since its too large.
Make a small sparse matrix with 10% sparsity:
In [1]: from scipy import sparse
In [2]: M = sparse.random(10,10,.1)
In [3]: M
Out[3]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in COOrdinate format>
The 10 nonzero values:
In [5]: sparse.find(M)
Out[5]:
(array([6, 4, 1, 2, 3, 0, 1, 6, 9, 6], dtype=int32),
array([1, 2, 3, 3, 3, 4, 4, 4, 5, 8], dtype=int32),
array([ 0.91828586, 0.29763717, 0.12771201, 0.24986069, 0.14674883,
0.56018409, 0.28643427, 0.11654358, 0.8784731 , 0.13253971]))
If, out of the 100 elements of the matrix, 10 are nonzero, then 90 elements are zero. Do you really want the indices of all of those?
where or nonzero on the dense equivalent gives the same indices:
In [6]: A = M.A # dense
In [7]: np.where(A)
Out[7]:
(array([0, 1, 1, 2, 3, 4, 6, 6, 6, 9], dtype=int32),
array([4, 3, 4, 3, 3, 2, 1, 4, 8, 5], dtype=int32))
And the indices of the 90 zero values:
In [8]: np.where(A==0)
Out[8]:
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9], dtype=int32),
array([0, 1, 2, 3, 5, 6, 7, 8, 9, 0, 1, 2, 5, 6, 7, 8, 9, 0, 1, 2, 4, 5, 6,
7, 8, 9, 0, 1, 2, 4, 5, 6, 7, 8, 9, 0, 1, 3, 4, 5, 6, 7, 8, 9, 0, 1,
2, 3, 4, 5, 6, 7, 8, 9, 0, 2, 3, 5, 6, 7, 9, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 6, 7, 8, 9], dtype=int32))
That's 2 arrays of shape (90,), 180 integers, as opposed to the 100 values in the the dense array itself. If your sparse matrix is too large to convert to dense, it will be too large to produce all the zero indices (assuming reasonable sparsity).
The print(M) shows the same triplets as the find. The attributes of the coo format also give the nonzero indices:
In [13]: M.row
Out[13]: array([6, 6, 3, 4, 1, 6, 9, 2, 1, 0], dtype=int32)
In [14]: M.col
Out[14]: array([1, 4, 3, 2, 3, 8, 5, 3, 4, 4], dtype=int32)
(Sometimes manipulation of a matrix can set values to 0 without removing them from the attributes. So find/nonzero takes an added step to remove those, if any.)
We could apply find to M==0 as well - but sparse will give us a warning.
In [15]: sparse.find(M==0)
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
", try using != instead.", SparseEfficiencyWarning)
It's the same thing that I've been warning about - the large size of this set. The resulting arrays are the same as in Out[8].
Assuming you have a scipy sparse array and have imported find:
from itertools import product
I, J, _= find(your_sparse_array)
nonzero = zip(I, J)
nrows, ncols = your_sparse_array.shape
for a, b in product(range(nrows), range(ncols)):
if (a,b) not in nonzero: print(a, b)