Selecting randomly from two arrays based upon condition in Python - python

Suppose i have two arrays of equal lengths:
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
Now i want to pick up elements from these two arrays , in the sequence given such that they form a new array of same length as a & b by randomly selecting values between a & b, in the ratio of a:b = 4.68 i.e for every 1 value picked from a , there should be 4.68 values picked from b in the resultant array.
So effectively the resultant array could be something like :
res = [0,1,1,0,1, 1(from a) ,0(from a),1,1,0,0,1,1,0, 0(from a),0,0]
res array has : first 5 values are from b ,6th & 7th from a ,8th-14th from b , 15th from a ,16th-17th from b
Overall ratio of values from a:b in the given res array example is a:b 4.67 ( from a = 3 ,from b = 14 )
Thus between the two arrays, values have to be chosen at random however the sequence needs to be maintained i.e cannot take 7th value from one array and 3rd value from other .If the value to be populated in resultant array is 3rd then the choice is between the 3rd element of both input arrays at random.Also, overall ratio needs to be maintained as well.
Can you please help me in developing an efficient Pythonic way of reaching this resultant solution ? The solution need not be consistent with every run w.r.t values

Borrowing the a_count calculation from Barmar's answer (because it seems to work and I can't be bothered to reinvent it), this solution preserves the ordering of the values chosen from a and b:
from future_builtins import zip # Only on Python 2, to avoid temporary list of tuples
import random
# int() unnecessary on Python 3
a_count = int(round(1/(1 + 4.68) * len(a)))
# Use range on Python 3, xrange on Python 2, to avoid making actual list
a_indices = frozenset(random.sample(xrange(len(a)), a_count))
res = [aval if i in a_indices else bval for i, (aval, bval) in enumerate(zip(a, b))]
The basic idea here is that you determine how many a values you need, get a unique sample of the possible indices of that size, then iterate a and b in parallel, keeping the a value for the selected indices, and the b value for all others.
If you don't like the complexity of the list comprehension, you could use a different approach, copying b, then filling in the a values one by one:
res = b[:] # Copy b in its entirety
# Replace selected indices with a values
# No need to convert to frozenset for efficiency here, and it's clean
# enough to just iterate the sample directly without storing it
for i in random.sample(xrange(len(a)), a_count):
res[i] = a[i]

I believe this should work. You specify how many you want from a (you can simply use your ratio to figure out that number), you randomly generate a 'mask' of numbers and choose from a or be based on the cutoff (notice that you only sort to figure out the cutoff, but you use the unsorted mask later)
import numpy as np
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
mask = np.random.random(len(a))
from_a = 3
cutoff = np.sort(mask)[from_a]
res = []
for i in range(len(a)):
if (mask[i]>=cutoff):
res.append(a[i])
else:
res.append(b[i])

Related

How to extract subarrays from an array based on threshold values in python?

I have a numpy array of the form:
a = numpy.array([0,2,2,3,4,2,5,5,6,2,5,6,4,4,2,3,1,7,7,2,3,3,4,1,8,9,8,8])
threshold = 4
threshold_seq_len = 5
subarray_seq_len = 4
the output that I am looking to achieve is
b =[array([5,5,6,2,5,6]), array([8,9,8,8])]
I would like to extract subarrays based on the criteria:
1) the subarrays should be split based on a sequence of values that are below or equal to threshold. In the above case, the first subarray ([5,5,6,2,5,6]) occurs after the sequence [0,2,2,3,4,2], all of which are below or equal to the threshold value of 4.
2) the threshold sequences should be at least as long as threshold_seq_len, otherwise they would just be part of the subarray. Notice, that the value '2' exists in the first subarray because it is a singular occurrence (length =1).
3) The subarrays themselves should be at least as long as subarray_seq_len. For example, the the values at indices 17 and 18 are 7 each, but they are not considered since the length<4.
For context, the arrays represent amplitudes in an audio file, and I am trying to extract viable non-silence candidates based on the described logic.
What is a pythonic way of achieving this efficiently?
I have tried the approaches described in Extract subarrays of numpy array whose values are above a threshold.
The issue is, that question seems to be a specific case of my problem (threshold_seq_len=1, subarray_seq_len=1) since the task involves merely spiltting an array based on the occurrence of threshold values. I have been trying to generalize it but have failed so far.
Here's one way -
from scipy.ndimage.morphology import binary_closing
def filter_ar(a, threshold, threshold_seq_len, subarray_seq_len):
# Mask wrt threshold
m0 = np.r_[False,a>threshold,False]
# Close "holes", those one-off lesser than thresh elements
k = np.ones(2,dtype=bool)
m = binary_closing(m0,k)
# Get initial start, stop indices
idx = np.flatnonzero(m[:-1] != m[1:])
s0,s1 = idx[::2],idx[1::2]
# Masks based on subarray_seq_len, threshold_seq_len
mask1 = (s1-s0)>=subarray_seq_len
mask2 = np.add.reduceat(m0,s0) >= threshold_seq_len
# Get combined one after looking for first sequence that has threshold_seq_len
# elements > threshold
mask1[mask2.argmax():] &= True
# Get valid start,stop indices and then split input array
starts,ends = s0[mask1],s1[mask1]
out = [a[i:j] for (i,j) in zip(starts,ends)]
return out
This does work on your example, but I wasn't able to avoid a list comprehension. Also, I haven't checked whether this is slower than simply iterating over a list... (might be)
b = np.where(a > threshold)[0]
d = np.where(np.diff(b) >= threshold_seq_len)[0]
e = np.split(b,d+1)
subarrays = [a[i[0]:i[-1]+1] for i in e if (i[-1]-i[0] + 1) >= subarray_seq_len]

Function that returns the closest number to B that's in an UNSORTED multidimensional array, A?

As the title states, I want to create a function that'll take a multidimensional array A, and a number B, that ultimately returns the number in A that is the closest to B. If the number B is in A, then return it. If there's 2 numbers in A that are equally distant from B, choose the first one by counting from row to row.
This is the code I have so far:
import numpy as np
def g_C(A,B):
A = np.asanyarray(A)
assert A.ndim == 2 # to assert that A is a multidimensional array.
get = (np.abs(A-B)).argmin()
return (A[get])
However from my understanding, I think (np.abs(M-N)).argmin() really only effectively works for sorted arrays? I'm not allowed to sort the array in this problem; I have to work on it for face value, examining row by row, and grabbing the first instance of the closest number to B.
So for example, g_C([[1,3,6,-8],[2,7,1,0],[4,5,2,8],[2,3,7,10]],9) should return 8
Also, I was given the hint that numpy.argmin would help, and I see that it's purpose is to extract the first occurrence something occurs, which makes sense in this problem, but I'm not sure how exactly to fit that into the code I have at the moment.
EDIT
The flat suggestion works perfectly fine. Thank you everyone.
I'm trying RagingRoosevelt's second suggestion, and I'm stuck.
def g_C(A,B):
A = np.asanyarray(A)
D = np.full_like(A, B) # created an array D with same qualities as array A, but just filled with values of B
diffs = abs(D-A) # finding absolute value differences between D and A
close = diffs.argmin(axis=1) # find argmin of 'diffs', row by row
close = np.asanyarray(close) # converted the argmins of 'diff' into an array
closer = close.argmin() # the final argmin ??
return closer
I'm trying out this suggestion because I have another problem related to this where I have to extract the row who's sum is the closest number to B. And I figure this is good practice anyway.
Your existing code is fine except, by default, argmin returns an index to the flattened array. So you could do
return A.flat[abs(A - B).argmin()]
to get the right value from A.
EDIT: For your other problem - finding the row in a 2-dimensional array A whose sum is closest to B - you can do:
return A[abs(A.sum(axis=1) - B).argmin()]
In either case I don't see any need to create an array of B.
Your problem is the same as a find-min problem. The only difference is that you're looking for min(abs(A[i]-B)) instead. So, iterate over your array. As you do so, record the smallest absolute delta and the index at which it occurred. When you find a smaller delta, update the record and then keep searching. When you've made it all the way through, return whatever value was at the recorded index.
Since you're working with numpy arrays, another approach is that you could create an array of identical size as A but filled only with value B. Compute the difference between the arrays and then use argmin on each row. Assemble an array of all minimum values for each row and then do argmin again to pull out the smallest of the values.
This will work for any 2-dimensional array with a nested for-loop, but I am not sure that this is what you want (as in it doesn't use numpy).
def g_C(A, B):
i = A[0][0]
m = abs(B - A[0][0])
for r in A:
for i in r:
if abs(B - i) < m:
m = abs(B - i)
n = i
return n
Nevertheless, it does work:
>>> g_C([[1,3,6,-8],[2,7,1,0],[4,5,2,8],[2,3,7,10]],9)
8

Invertable Cartesian Product Elements/Index Translation Function

I have a problem where I need to identify the elements found at an indexed position within
the Cartesian product of a series of lists but also, the inverse, i.e. identify the indexed position from a unique combination of elements from a series of lists.
I've written the following code which performs the task reasonably well:
import numpy as np
def index_from_combination(meta_list_shape, index_combination ):
list_product = np.prod(meta_list_shape)
m_factor = np.cumprod([[l] for e,l in enumerate([1]+meta_list_shape)])[0:len(meta_list_shape)]
return np.sum((index_combination)*m_factor,axis=None)
def combination_at_index(meta_list_shape, index ):
il = len(meta_list_shape)-1
list_product = np.prod(meta_list_shape)
assert index < list_product
m_factor = np.cumprod([[l] for e,l in enumerate([1]+meta_list_shape)])[0:len(meta_list_shape)][::-1]
idxl = []
for e,m in enumerate(m_factor):
if m<=index:
idxl.append((index//m))
index = (index%m)
else:
idxl.append(0)
return idxl[::-1]
e.g.
index_from_combination([3,2],[2,1])
>> 5
combination_at_index([3,2],5)
>> [2,1]
Where [3,2] describes a series of two lists, one containing 3 elements, and the other containing 2 elements. The combination [2,1] denotes a permutation consisting of the 3rd element (zero-indexing) from the 1st list, and the 2nd element (again zero-indexed) from the second list.
...if a little clunkily (and, to save space, one that ignores the actual contents of the lists, and instead works with indexes used elsewhere to fetch the contents from those lists - that's not important here though).
N.B. What is important is that my functions mirror one another such that:
F(a)==b and G(b)==a
i.e. they are the inverse of one another.
From the linked question, it turns out I can replace the second function with the one-liner:
list(itertools.product(['A','B','C'],['P','Q','R'],['X','Y']))[index]
Which will return the unique combination of values for a supplied index integer (though with some question-mark in my mind about how much of that list is instantiated in memory - but again, that's not necessarily important right now).
What I'm asking is, itertools appears to have been built with these types of problems in mind - is there an equally neat one-line inverse to the itertools.product function that, given a combination, e.g. ['A','Q','Y'] will return an integer describing that combination's position within the cartesian product, such that this integer, if fed into the itertools.product function will return the original combination?
Imagine those combinations as two dimensional X-Y coordinates and use subscript to linear-index conversion and vice-verse. Thus, use NumPy's built-ins np.ravel_multi_index for getting the linear index and np.unravel_index for the subscript indices, which becomes your index_from_combination and combination_at_index respectively.
It's a simple translation and doesn't generate any combination whatsoever, so should be a breeze.
Sample run to make things clearer -
In [861]: np.ravel_multi_index((2,1),(3,2))
Out[861]: 5
In [862]: np.unravel_index(5, (3,2))
Out[862]: (2, 1)
The math is simple enough to be implemented if you don't want to NumPy dependency for some reason -
def index_from_combination(a, b):
return b[0]*a[1] + b[1]
def combination_at_index(a, b):
d = b//a[1]
r = b - a[1]*d
return d, r
Sample run -
In [881]: index_from_combination([3,2],[2,1])
Out[881]: 5
In [882]: combination_at_index([3,2],5)
Out[882]: (2, 1)

Making a index array from an array in numpy

Good morning experts,
I have an array which contain integer numbers, and I have a list with the unique values that are in the array sorted in special order. What I want is to make another array which will contain the indexes of each value in the a array.
#a numpy array with integer values
#size_x and size_y: array dimensions of a
#index_list contain the unique values of a sorted in a special order.
#b New array with the index values
for i in xrange(0,size_x):
for j in xrange(0,size_y):
b[i][j]=index_list.index(a[i][j])
This works but it takes long time to do it. Is there a faster way to do it?
Many thanks for your help
German
The slow part is the lookup
index_list.index(a[i][j])
It will be much quicker to use a Python dictionary for this task, ie. rather than
index_list = [ item_0, item_1, item_2, ...]
use
index_dict = { item_0:0, item_1:1, item_2:2, ...}
Which can be created using:
index_dict = dict( (item, i) for i, item in enumerate(index_list) )
Didn't try, but as this is pure numpy, it should be much faster then a dictionary based approach:
# note that the code will use the next higher value if a value is
# missing from index_list.
new_vals, old_index = np.unique(index_list, return_index=True)
# use searchsorted to find the index:
b_new_index = np.searchsorted(new_vals, a)
# And the original index:
b = old_index[b_new_index]
Alternatively you could simply fill any wholes in index_list.
Edited code, it was as such quite simply wrong (or very limited)...

Creating an array without certain ranges

In python I have numpy.ndarray called a and a list of indices called b. I want to get a list of all the values of a which are not in -10..10 places around the indices of b.
This is my current code, which takes a lot of time to run due to allocations of data (a is very big):
aa=a
# Remove all ranges backwards
for bb in b[::-1]:
aa=np.delete(aa, range(bb-10,bb+10))
Is there a way to do it more efficiently? Preferably with few memory allocations.
np.delete will take an array of indicies of any size. You can simply populate your entire array of indicies and perform the delete once, therefore only deallocating and reallocating once. (not tested. possible typos.)
bb = np.empty((b.size, 21), dtype=int)
for i,v in enumerate(b):
bb[i] = v+np.arange(-10,11)
np.delete(a, bb.flat) # looks like .flat is optional
Note, if your ranges overlap, you'll get a difference between this and your algorithm: where yours will remove more items than those originally 10 indices away.
Could you find a certain number that you're sure will not be in a, and then set all indices around the b indices to that number, so that you can remove it afterwards?
import numpy as np
for i in range(-10, 11):
a[b + i] = number_not_in_a
values = set(np.unique(a)) - set([number_not_in_a])
This code will not allocate new memory for a at all, needs only one range object created, and does the job in exactly 22 c-optimized numpy operations (well, 43 if you count the b + i operations), plus the cost of turning the unique return array into a set.
Beware, if b includes indices which are less than 10, the number_not_in_a "zone" around these indices will wrap around to the other end of the array. If b includes indices larger than len(a) - 11, the operation will fail with an IndexError at some point.

Categories