Choosing random one from each row of a binary numpy matrix? - python

Suppose I have a binary matrix. I would like to cast that matrix into another matrix where each row has single one and the index of that one would be random for each row.
For instance if one of the row is [0,1,0,1,0,0,1] and I cast it to [0,0,0,1,0,0,0] where we select the 1's index randomly.
How could I do it in numpy?
Presently I find the max (since max function returns one index) of each row and set it to zero. It works if all rows have at most 2 ones but if more than 2 then it fails.

Extending #zhangxaochen's answer, given a random binary array
x = np.random.random_integers(0, 1, (8, 8))
you can populate another array with a randomly drawn 1 from x:
y = np.zeros_like(x)
ind = [np.random.choice(np.where(row)[0]) for row in x]
y[range(x.shape[0]), ind] = 1

I'd like to use np.argsort to get the indices of the non-zero elements:
In [351]: a=np.array([0,1,0,1,0,0,1])
In [352]: from random import choice
...: idx=choice(np.where(a)[0]) #or np.nonzero instead of np.where
In [353]: b=np.zeros_like(a)
...: b[idx]=1
...: b
Out[353]: array([0, 1, 0, 0, 0, 0, 0])

Related

Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

Given a 2D numpy array, I want to construct an array out of the column indices of the maximum value of each row. So far, arr.argmax(1) works well. However, for my specific case, for some rows, 2 or more columns may contain the maximum value. In that case, I want to select a column index randomly (not the first index as it is the case with .argmax(1)).
For example, for the following arr:
arr = np.array([
[0, 1, 0],
[1, 1, 0],
[2, 1, 3],
[3, 2, 2]
])
there can be two possible outcomes: array([1, 0, 2, 0]) and array([1, 1, 2, 0]) each chosen with 1/2 probability.
I have code that returns the expected output using a list comprehension:
idx = np.arange(arr.shape[1])
ans = [np.random.choice(idx[ix]) for ix in arr == arr.max(1, keepdims=True)]
but I'm looking for an optimized numpy solution. In other words, how do I replace the list comprehension with numpy methods to make the code feasible for bigger arrays?
Use scipy.stats.rankdata and apply_along_axis as follows.
import numpy as np
from scipy.stats import rankdata
ranks = rankdata(-arr, axis = 1, method = "min")
func = lambda x: np.random.choice(np.where(x==1)[0])
idx = np.apply_along_axis(func, 1, ranks)
print(idx)
It returns [1 0 2 0] or [1 1 2 0].
The main idea is rankdata calculates ranks of every value in each row, and the maximum value will have 1. func randomly choices one of index whose corresponding value is 1. Finally, apply_along_axis applies the func to every row of arr.
After some advice I got offline, it turns out that randomization of maximum values are possible when we multiply the boolean array that flags row-wise maximum values by a random array of the same shape. Then what remains is a simple argmax(1) call.
# boolean array that flags maximum values of each row
mxs = arr == arr.max(1, keepdims=True)
# random array where non-maximum values are zero and maximum values are random values
random_arr = np.random.rand(*arr.shape) * mxs
# row-wise maximum of the auxiliary array
ans = random_arr.argmax(1)
A timeit test shows that for data of shape (507_563, 12), this code runs in ~172 ms on my machine while the loop in the question runs for 11 sec, so this is about 63x faster.

Counting occurrences of elements of one array in another array

I want to find frequency of elements of a given one dimensional numpy array (arr1) in another one dimensional numpy array (arr2). The array arr1 contains elements with no repetitions. Also, all elements in arr1 are part of the array of unique elements of arr2
Consider this as an example,
arr1 = np.array([1,2,6])
arr2 = np.array([2, 3, 6, 1, 2, 1, 2, 0, 2, 0])
At present, I am using the following:
freq = np.zeros( len(arr1) )
for i in range( len(arr1) ):
mark = np.where( arr2==arr1[i] )
freq[i] = len(mark[0])
print freq
>>[2, 4, 1]
The aforementioned method gives me the correct answer. But, I want to know if there is a better/more efficient method than the one that I am following.
Here's a vectorized solution based on np.searchsorted -
idx = np.searchsorted(arr1,arr2)
idx[idx==len(arr1)] = 0
mask = arr1[idx]==arr2
out = np.bincount(idx[mask])
It assumes arr1 is sorted. If not so, we got two solutions :
Sort arr1 as the pre-processing step. Since, arr1 is part of unique elements from arr2, this should be a comparatively smaller array and hence an inexpensive sorting operation.
Use sorter arg with searchsorted to compute idx:
sidx = arr1.argsort();
idx = sidx[np.searchsorted(arr1,arr2,sorter=sidx)]

Python: Plot an array of strings with repeated entries vs float without for loop

Hi I am trying to plot a numpy array of strings in y axis, for example
arr = np.array(['a','a','bas','dgg','a']) #The actual strings are about 11 characters long
vs a float array with equal length. The string array I am working with is very large ~ 100 million entries. One of the solutions I had in mind was to convert the string array to unique integer ids, for example,
vocab = np.unique(arr)
vocab = list(vocab)
arrId = np.zeros(len(arr))
for i in range(len(arr)):
arrId[i] = vocab.index(arr[i])
and then matplotlib.pyplot.plot(arrId). But I cannot afford to run a for loop to convert the array of strings to an array of unique integer ids. In an initial search I could not find a way to map strings to an unique id without using a loop. Maybe I am missing something, but is there a smart way to do this in python?
EDIT -
Thanks. The solutions provided use vocab,ind = np.unique(arr, return_index = True) where idx is the returned unique integer array. But it seems like np.unique is O(N*log(N)) according to this ( numpy.unique with order preserved), but pandas.unique is of order O(N). But I am not sure how to get ind from pandas.unique. plotting data i guess can be done in O(N). So I was wondering is there a way to do this O(N)? perhaps by hashing of some sort?
numpy.unique used with the return_inverse argument allows you to obtain the inverted index.
arr = np.array(['a','a','bas','dgg','a'])
unique, rev = np.unique(arr, return_inverse=True)
#unique: ['a' 'bas' 'dgg']
#rev: [0 0 1 2 0]
such that unique[rev] returns the original array ['a' 'a' 'bas' 'dgg' 'a'].
This can be easily used to plot the data.
import numpy as np
import matplotlib.pyplot as plt
arr = np.array(['a','a','bas','dgg','a'])
x = np.array([1,2,3,4,5])
unique, rev = np.unique(arr, return_inverse=True)
print unique
print rev
print unique[rev]
fig,ax=plt.subplots()
ax.scatter(x, rev)
ax.set_yticks(range(len(unique)))
ax.set_yticklabels(unique)
plt.show()
you can factorize your strings:
In [75]: arr = np.array(['a','a','bas','dgg','a'])
In [76]: cats, idx = np.unique(arr, return_inverse=True)
In [77]: plt.plot(idx)
Out[77]: [<matplotlib.lines.Line2D at 0xf82da58>]
In [78]: cats
Out[78]:
array(['a', 'bas', 'dgg'],
dtype='<U3')
In [79]: idx
Out[79]: array([0, 0, 1, 2, 0], dtype=int64)
You can use the numpy unique funciton to return a unique array of values?
print(np.unique(arr))
['a' 'bas' 'dgg']
collections.counter also return the value and number of counts:
print(collections.Counter(arr))
Counter({'a': 3, 'bas': 1, 'dgg': 1})
Does this help at all?

How to change values of a numpy array between two positions

I have a numpy array full of 0's such as this
[0,0,0,0,0,0,0,0,0,0,0,0]
And a list of positions such as this
[2-4,6-10]
So what I want to do is iterate through the list of positions and then change the 0's in the numpy array to 1's within the according positions so that I should have a numpy array such as.
[0,1,1,1,0,1,1,1,1,1,0,0,0]
Hope this is clear enough, if not just let me know.
Thanks.
Here's one approach by generating those indices as a concatenated array with np.r_ and then indexing and assigning 1s -
In [64]: a = np.array([0,0,0,0,0,0,0,0,0,0,0,0])
In [65]: pos = np.r_[1:4,5:10]
In [66]: a[pos] = 1
In [67]: a
Out[67]: array([0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0])
You could use a list of pairs to hold the positions;
l = [(2, 4), (6, 10)]
nl = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Secondly, use a loop;
for a in l:
for c, n in enumerate(nl)
if (a[0] >= c || a[1] <= c):
nl[c] = 1
This is by far not the fastest way to do this but it is simple and readable.
As suggested by this user you could use this instead, which is a lot better in my opinion;
nl[a[0]:[a[1]]=[1]*(a[1]-a[0])

index of a random choice from numpy array

I have a 2-d numpy array populated with integers [-1, 0, +1]. I need to choose a random element that is not zero from it and calculate the sum of its adjacent elements.
Is there a way to get the index of a numpy.random.choice?
lattice=np.zeros(9,dtype=numpy.int)
lattice[:2]=-1
lattice[2:4]=1
random.shuffle(lattice)
lattice=lattice.reshape((3,3))
random.choice(lattice[lattice!=0])
This gives the draw from the right sample, but I would need the index of the choice to be able to identify its adjacent elements. My other idea is to just sample from the index and then check if the element is non-zero, but this is obviously quite wasteful when there are a lot of zeros.
You can use lattice.nonzero() to get the locations of the nonzero elements [nonzero docs]:
>>> lnz = lattice.nonzero()
>>> lnz
(array([0, 0, 1, 1]), array([1, 2, 0, 1]))
which returns a tuple of arrays corresponding to the coordinates of the nonzero elements. Then you draw an index:
>>> np.random.randint(0, len(lnz[0]))
3
and use that to decide which coordinate you're interested in.

Categories