I have a 2-d numpy array populated with integers [-1, 0, +1]. I need to choose a random element that is not zero from it and calculate the sum of its adjacent elements.
Is there a way to get the index of a numpy.random.choice?
lattice=np.zeros(9,dtype=numpy.int)
lattice[:2]=-1
lattice[2:4]=1
random.shuffle(lattice)
lattice=lattice.reshape((3,3))
random.choice(lattice[lattice!=0])
This gives the draw from the right sample, but I would need the index of the choice to be able to identify its adjacent elements. My other idea is to just sample from the index and then check if the element is non-zero, but this is obviously quite wasteful when there are a lot of zeros.
You can use lattice.nonzero() to get the locations of the nonzero elements [nonzero docs]:
>>> lnz = lattice.nonzero()
>>> lnz
(array([0, 0, 1, 1]), array([1, 2, 0, 1]))
which returns a tuple of arrays corresponding to the coordinates of the nonzero elements. Then you draw an index:
>>> np.random.randint(0, len(lnz[0]))
3
and use that to decide which coordinate you're interested in.
Related
Given a 2D numpy array, I want to construct an array out of the column indices of the maximum value of each row. So far, arr.argmax(1) works well. However, for my specific case, for some rows, 2 or more columns may contain the maximum value. In that case, I want to select a column index randomly (not the first index as it is the case with .argmax(1)).
For example, for the following arr:
arr = np.array([
[0, 1, 0],
[1, 1, 0],
[2, 1, 3],
[3, 2, 2]
])
there can be two possible outcomes: array([1, 0, 2, 0]) and array([1, 1, 2, 0]) each chosen with 1/2 probability.
I have code that returns the expected output using a list comprehension:
idx = np.arange(arr.shape[1])
ans = [np.random.choice(idx[ix]) for ix in arr == arr.max(1, keepdims=True)]
but I'm looking for an optimized numpy solution. In other words, how do I replace the list comprehension with numpy methods to make the code feasible for bigger arrays?
Use scipy.stats.rankdata and apply_along_axis as follows.
import numpy as np
from scipy.stats import rankdata
ranks = rankdata(-arr, axis = 1, method = "min")
func = lambda x: np.random.choice(np.where(x==1)[0])
idx = np.apply_along_axis(func, 1, ranks)
print(idx)
It returns [1 0 2 0] or [1 1 2 0].
The main idea is rankdata calculates ranks of every value in each row, and the maximum value will have 1. func randomly choices one of index whose corresponding value is 1. Finally, apply_along_axis applies the func to every row of arr.
After some advice I got offline, it turns out that randomization of maximum values are possible when we multiply the boolean array that flags row-wise maximum values by a random array of the same shape. Then what remains is a simple argmax(1) call.
# boolean array that flags maximum values of each row
mxs = arr == arr.max(1, keepdims=True)
# random array where non-maximum values are zero and maximum values are random values
random_arr = np.random.rand(*arr.shape) * mxs
# row-wise maximum of the auxiliary array
ans = random_arr.argmax(1)
A timeit test shows that for data of shape (507_563, 12), this code runs in ~172 ms on my machine while the loop in the question runs for 11 sec, so this is about 63x faster.
Let's say I have the following array :
array([2, 0, 0, 1, 0, 1, 0, 0])
How do I get the indices where I have occurrence of sequence of values : [0,0]? So, the expected output for such a case would be : [1,2,6,7].
Edit :
1) Please note that [0,0] is just a sequence. It could be [0,0,0] or [4,6,8,9] or [5,2,0], just anything.
2) If my array were modified to : array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0]), the expected result with the same sequence of [0,0] would be [1,2,3,4,8,9].
I am looking for some NumPy shortcut.
Well, this is basically a template-matching problem that comes up in image-processing a lot. Listed in this post are two approaches: Pure NumPy based and OpenCV (cv2) based.
Approach #1: With NumPy, one can create a 2D array of sliding indices across the entire length of the input array. Thus, each row would be a sliding window of elements. Next, match up each row with the input sequence, which will bring in broadcasting for a vectorized solution. We look for all True rows indicating those are the ones that are the perfect matches and as such would be the starting indices of the matches. Finally, using those indices, create a range of indices extending up to the length of the sequence, to give us the desired output. The implementation would be -
def search_sequence_numpy(arr,seq):
""" Find sequence in an array using NumPy only.
Parameters
----------
arr : input 1D array
seq : input 1D array
Output
------
Output : 1D Array of indices in the input array that satisfy the
matching of input sequence in the input array.
In case of no match, an empty list is returned.
"""
# Store sizes of input array and sequence
Na, Nseq = arr.size, seq.size
# Range of sequence
r_seq = np.arange(Nseq)
# Create a 2D array of sliding indices across the entire length of input array.
# Match up with the input sequence & get the matching starting indices.
M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)
# Get the range of those indices as final output
if M.any() >0:
return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
else:
return [] # No match found
Approach #2: With OpenCV (cv2), we have a built-in function for template-matching : cv2.matchTemplate. Using this, we would have the starting matching indices. Rest of the steps would be same as for the previous approach. Here's the implementation with cv2 :
from cv2 import matchTemplate as cv2m
def search_sequence_cv2(arr,seq):
""" Find sequence in an array using cv2.
"""
# Run a template match with input sequence as the template across
# the entire length of the input array and get scores.
S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)
# Now, with floating point array cases, the matching scores might not be
# exactly zeros, but would be very small numbers as compared to others.
# So, for that use a very small to be used to threshold the scorees
# against and decide for matches.
thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.
# Find the matching indices
idx = np.where(S.ravel() < thresh)[0]
# Get the range of those indices as final output
if len(idx)>0:
return np.unique((idx[:,None] + np.arange(seq.size)).ravel())
else:
return [] # No match found
Sample run
In [512]: arr = np.array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0])
In [513]: seq = np.array([0,0])
In [514]: search_sequence_numpy(arr,seq)
Out[514]: array([1, 2, 3, 4, 8, 9])
In [515]: search_sequence_cv2(arr,seq)
Out[515]: array([1, 2, 3, 4, 8, 9])
Runtime test
In [477]: arr = np.random.randint(0,9,(100000))
...: seq = np.array([3,6,8,4])
...:
In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))
Out[478]: True
In [479]: %timeit search_sequence_numpy(arr,seq)
100 loops, best of 3: 11.8 ms per loop
In [480]: %timeit search_sequence_cv2(arr,seq)
10 loops, best of 3: 20.6 ms per loop
Seems like the Pure NumPy based one is the safest and fastest!
I find that the most succinct, intuitive and general way to do this is using regular expressions.
import re
import numpy as np
# Set the threshold for string printing to infinite
np.set_printoptions(threshold=np.inf)
# Remove spaces and linebreaks that would come through when printing your vector
yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]
# The next line is the most important, set the arguments in the braces
# such that the first argument is the shortest sequence you want
# and the second argument is the longest (using empty as infinite length)
r = re.compile(r"[0]{1,}")
zero_starts = [m.start() for m in r.finditer( yourarray_string )]
zero_ends = [m.end() for m in r.finditer( yourarray_string )]
I am trying to figure out how np.partition function works.
For example, consider
arr = np.array([5, 4, 1, 0, -1, -3, -4, 0])
If I call np.partition(arr, kth=2), I get
np.array([-4, -3, -1, 0, 1, 4, 5, 0])
I expect that, after partition, the array will split into elements less than one, one, and elements greater than one.
But the second zero is placed on the last array position, which isn't its right place after partition.
The documentation says:
Creates a copy of the array with its elements rearranged in such a way that
the value of the element in kth position is in the position it would be in
a sorted array. All elements smaller than the kth element are moved before
this element and all equal or greater are moved behind it. The ordering of
the elements in the two partitions is undefined.
In the example you give, you have selected 2th element of the sorted list (starting from zero), which is -1, and it seems to be in the right position if the array was sorted.
The docs talk of 'a sorted array'.
np.partition starts by sorting the elements in the array provided. In this case the original array is:
arr = [ 5, 4, 1, 0, -1, -3, -4, 0]
When sorted, we have:
arr_sorted = [-4 -3 -1 0 0 1 4 5]
Hence the call, np.partition(arr, kth=2), will actually have the kth as the the element in position 2 of the arr_sorted, not arr. The element is correctly picked as -1.
When I first read the official document of numpy.partition, I also interpreted its meaning in the same way as the OP did. So I was confused when I read the examples given in the documents, but could not figure out where my understanding is wrong. I google it and got here.
Considering that the confusion is frequent, so the document should be revised. I suggest using the following:
Creates a copy of the array with its elements rearranged in such a way that:
the k-th element of the new array is in the position it would be in a sorted array. All elements smaller than the k-th element are moved before this element and all greater are moved behind it. The ordering of the elements in the two partitions is undefined. If there are other elements that are equal to the k-th element, these elements may appear before or hehind the k-th element.
Suppose that I have a numpy.array of values, say
values = np.array([0, 3, 2, 4, 6])
and a numpy.array of indices, say
idces = np.array([1, 3, 5]).
I want to obtain an array which has a given value, say -1, in the positions of the idces, and the other elements distributed in the remaining locations. So in the case above I want to obtain
np.array([0, -1, 3, -1, 2, -1, 4, 6]).
This looks like the task of np.insert, except that the latter inserts values before the values at the specified indexes, rather than at the specified indexes (and the two coincide only when there is only one index).
So the best I could come up with is
np.insert(values, idces - np.arange(len(idces)), -1).
This is still better than creating an array with -np.ones, calculating the indices of the idces and then using np.put... but I was wondering: is there any cleaner way?
Insertion is best thought of in terms of offsets, which enumerate not the array elements but the gaps between (or before/after) them:
The documentation of np.insert describes it as "the index or indices before which values is inserted" which is only approximately right. An offset can be equal to len(arr) (end of array) even though arr[len(arr)] throws out-of-bounds error.
For example, np.insert([3, 1, 4, 1, 5], [1, 3, 3, 5], [0, 0, 0, 0]) means: put one zero at the gap numbered 1, two others at the gap numbered 3, and the last one at the end. The result is [3, 0, 1, 4, 0, 0, 1, 5, 0].
Some advantages of this enumeration over specifying post-insertion indices of new elements:
1) It's easier to insert a bunch of elements at one place: np.insert(arr, [3]*values.size, values) inserts the array values at the 3rd offset.
2) It's easier to interlace two arrays, with np.insert(arr, np.arange(values.size), values)
3) It's easier to control whether an insertion point is valid; the validity does not depend on how many elements are being inserted.
The case when you know post-insertion indices idces is easy enough to handle, as you did with
np.insert(values, idces - np.arange(len(idces)), -1)
Related issue on NumPy tracker.
Suppose I have a binary matrix. I would like to cast that matrix into another matrix where each row has single one and the index of that one would be random for each row.
For instance if one of the row is [0,1,0,1,0,0,1] and I cast it to [0,0,0,1,0,0,0] where we select the 1's index randomly.
How could I do it in numpy?
Presently I find the max (since max function returns one index) of each row and set it to zero. It works if all rows have at most 2 ones but if more than 2 then it fails.
Extending #zhangxaochen's answer, given a random binary array
x = np.random.random_integers(0, 1, (8, 8))
you can populate another array with a randomly drawn 1 from x:
y = np.zeros_like(x)
ind = [np.random.choice(np.where(row)[0]) for row in x]
y[range(x.shape[0]), ind] = 1
I'd like to use np.argsort to get the indices of the non-zero elements:
In [351]: a=np.array([0,1,0,1,0,0,1])
In [352]: from random import choice
...: idx=choice(np.where(a)[0]) #or np.nonzero instead of np.where
In [353]: b=np.zeros_like(a)
...: b[idx]=1
...: b
Out[353]: array([0, 1, 0, 0, 0, 0, 0])