Compare multiple columns in numpy array - python

I have a 2D numpy array with about 12 columns and 1000+ rows and each cell contains a number from 1 to 5. I'm searching for the best sextuple of columns according to my point system where 1 and 2 generate -1 point and 4 and 5 gives +1.
If a row in a certain sextuple contains, for example, [1, 4, 5, 3, 4, 3] the point for this row should be +2, because 3*1 + 1*(-1) = 2. Next row may be [1, 2, 2, 3, 3, 3] and should be -3 points.
At first, I tried a strait forward loop solution but I realized there are 665 280 possible combinations of columns to compare and when I also need to search for the best quintuple, quadruple etc. the loop is taking forever.
Is there perhaps a smarter numpy-way of solving my problem?

import numpy as np
import itertools
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
print(y)
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))
print('''\
score: {s}
sextuple: {c}
'''.format(s = score, c = best_sextuple))
yields, for example,
score: 6
sextuple: (0, 1, 5, 8, 10, 11)
Explanation:
First, let's generate a random example, with 12 columns and 10 rows:
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
Now we can use numpy indexing to convert the numbers in arr 1,2,...,5 to the values -1,0,1 (according to your scoring system):
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
Next, let's use itertools.combinations to generate all possible combinations of 6 columns:
for cols in itertools.combinations(range(12),6)
and
y[:,cols].sum()
then gives the score for cols, a choice of columns (sextuple).
Finally, use max to pick off the sextuple with the best score:
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))

import numpy
A = numpy.random.randint(1, 6, size=(1000, 12))
points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5)
columnsums = numpy.sum(points, 0)
def best6(row):
return numpy.argsort(row)[-6:]
bestcolumns = best6(columnsums)
allbestcolumns = map(best6, points)
bestcolumns will now contain the best 6 columns in ascending order. By similar logic, allbestcolumns will contain the best six columns in each row.

Extending on unutbu's longer answer above, it's possible to generate the masked array of scores automatically. Since your scores for values are consistent every pass through the loop, so the scores for each value only need to be calculated once. Here's slightly inelegant way to do it on an example 6x10 array, before and after your scores are applied.
>>> import numpy
>>> values = numpy.random.randint(6, size=(6,10))
>>> values
array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4],
[2, 5, 2, 2, 3, 1, 3, 5, 3, 1],
[3, 3, 5, 4, 2, 1, 4, 0, 0, 1],
[2, 4, 0, 0, 4, 1, 4, 0, 1, 0],
[0, 4, 1, 2, 0, 3, 3, 5, 0, 1],
[2, 3, 3, 4, 0, 1, 1, 1, 3, 2]])
>>> b = values.copy()
>>> b[ b<3 ] = -1
>>> b[ b==3 ] = 0
>>> b[ b>3 ] = 1
>>> b
array([[ 1, 1, -1, -1, -1, 1, -1, -1, -1, 1],
[-1, 1, -1, -1, 0, -1, 0, 1, 0, -1],
[ 0, 0, 1, 1, -1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, 1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, -1, 0, 0, 1, -1, -1],
[-1, 0, 0, 1, -1, -1, -1, -1, 0, -1]])
Incidentally, this thread claims that creating the combinations directly within numpy will yield around 5x faster performance than itertools, though perhaps at the expense of some readability.

Related

Find first n non zero values in in numpy 2d array

I would like to know the fastest way to extract the indices of the first n non zero values per column in a 2D array.
For example, with the following array:
arr = [
[4, 0, 0, 0],
[0, 0, 0, 0],
[0, 4, 0, 0],
[2, 0, 9, 0],
[6, 0, 0, 0],
[0, 7, 0, 0],
[3, 0, 0, 0],
[1, 2, 0, 0],
With n=2 I would have [0, 0, 1, 1, 2] as xs and [0, 3, 2, 5, 3] as ys. 2 values in the first and second columns and 1 in the third.
Here is how it is currently done:
x = []
y = []
n = 3
for i, c in enumerate(arr.T):
a = c.nonzero()[0][:n]
if len(a):
x.extend([i]*len(a))
y.extend(a)
In practice I have arrays of size (405, 256).
Is there a way to make it faster?
Here is a method, although quite confusing as it uses a lot of functions, that does not require sorting the array (only a linear scan is necessary to get non null values):
n = 2
# Get indices with non null values, columns indices first
nnull = np.stack(np.where(arr.T != 0))
# split indices by unique value of column
cols_ids= np.array_split(range(len(nnull[0])), np.where(np.diff(nnull[0]) > 0)[0] +1 )
# Take n in each (max) and concatenate the whole
np.concatenate([nnull[:, u[:n]] for u in cols_ids], axis = 1)
outputs:
array([[0, 0, 1, 1, 2],
[0, 3, 2, 5, 3]], dtype=int64)
Here is one approach using argsort, it gives a different order though:
n = 2
m = arr!=0
# non-zero values first
idx = np.argsort(~m, axis=0)
# get first 2 and ensure non-zero
m2 = np.take_along_axis(m, idx, axis=0)[:n]
y,x = np.where(m2)
# slice
x, idx[y,x]
# (array([0, 1, 2, 0, 1]), array([0, 2, 3, 3, 5]))
Use dislocation comparison for the row results of the transposed nonzero:
>>> n = 2
>>> i, j = arr.T.nonzero()
>>> mask = np.concatenate([[True] * n, i[n:] != i[:-n]])
>>> i[mask], j[mask]
(array([0, 0, 1, 1, 2], dtype=int64), array([0, 3, 2, 5, 3], dtype=int64))

Replace consecutive duplicates in 2D numpy array

I have a two dimensional numpy array x:
import numpy as np
x = np.array([
[1, 2, 8, 4, 5, 5, 5, 3],
[0, 2, 2, 2, 2, 1, 1, 4]
])
My goal is to replace all consecutive duplicate numbers with a specific value (lets take -1), but by leaving one occurrence unchanged.
I could do this as follows:
def replace_consecutive_duplicates(x):
consec_dup = np.zeros(x.shape, dtype=bool)
consec_dup[:, 1:] = np.diff(x, axis=1) == 0
x[consec_dup] = -1
return x
# current output
replace_consecutive_duplicates(x)
# array([[ 1, 2, 8, 4, 5, -1, -1, 3],
# [ 0, 2, -1, -1, -1, 1, -1, 4]])
However, in this case the one occurrence left unchanged is always the first.
My goal is to leave the middle occurrence unchanged.
So given the same x as input, the desired output of function replace_consecutive_duplicates is:
# desired output
replace_consecutive_duplicates(x)
# array([[ 1, 2, 8, 4, -1, 5, -1, 3],
# [ 0, -1, 2, -1, -1, 1, -1, 4]])
Note that in case consecutive duplicate sequences with an even number of occurrences the middle left value should be unchanged. So the consecutive duplicate sequence [2, 2, 2, 2] in x[1] becomes [-1, 2, -1, -1]
Also note that I'm looking for a vectorized solution for 2D numpy arrays since performance is of absolute importance in my particular use case.
I've already tried looking at things like run length encoding and using np.diff(), but I didn't manage to solve this. Hope you guys can help!
The main problem is that you require the length of the number of consecutives values. This is not easy to get with numpy, but using itertools.groupby we can solve it using the following code.
import numpy as np
x = np.array([
[1, 2, 8, 4, 5, 5, 5, 3],
[0, 2, 2, 2, 2, 1, 1, 4]
])
def replace_row(arr: np.ndarray, new_val=-1):
results = []
for val, count in itertools.groupby(arr):
k = len(list(count))
results.extend([new_val] * ((k - 1) // 2))
results.append(val)
results.extend([new_val] * (k // 2))
return np.fromiter(results, arr.dtype)
if __name__ == '__main__':
for idx, row in enumerate(x):
x[idx, :] = replace_row(row)
print(x)
Output:
[[ 1 2 8 4 -1 5 -1 3]
[ 0 -1 2 -1 -1 1 -1 4]]
This isn't vectorized, but can be combined with multi threading since every row is handled one by one.

Concatenate two numpy arrays so that index order keeps the same?

Assume I have two numpy arrays as follows:
{0: array([ 2, 4, 8, 9, 12], dtype=int64),
1: array([ 1, 3, 5], dtype=int64)}
Now I want to replace each array with the ID at the front, i.e. the values in array 0 become 0 and in array 1 become 1, then both arrays should be merged, whereby the index order must be correct.
I.e. desired output:
array([1, 0, 1, 0, 1, 0, 0 ,0])
But that's what I get:
np.concatenate((h1,h2), axis=0)
array([0, 0, 0, 0, 0, 1, 1, 1])
(Each array contains only unique values, if this helps.)
How can this be done?
Your description of merging is a bit unclear. But here's something that makes sense
In [399]: dd ={0: np.array([ 2, 4, 8, 9, 12]),
...: 1: np.array([ 1, 3, 5])}
In [403]: res = np.zeros(13, int)
In [404]: res[dd[0]] = 0
In [405]: res[dd[1]] = 1
In [406]: res
Out[406]: array([0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])
Or to make the assignments clearer:
In [407]: res = np.zeros(13, int)
In [408]: res[dd[0]] = 2
In [409]: res[dd[1]] = 1
In [410]: res
Out[410]: array([0, 1, 2, 1, 2, 1, 0, 0, 2, 2, 0, 0, 2])
Otherwise the talk index positions doesn't make a whole lot of sense.
Something like this?
d = {0: array([ 2, 4, 8, 9, 12], dtype=int64),
1: array([ 1, 3, 5], dtype=int64)}
(np.concatenate([d[0],d[1]]).argsort(kind="stable")>=len(d[0])).view(np.uint8)
# array([1, 0, 1, 0, 1, 0, 0, 0], dtype=uint8)
.concatenate Just appends lists/arrays.
Maybe an unconventional way to go about it, but you could repeat the [0 1] pattern for the len of the shortest array, using numpy.repeat and then add repeated 1 values for the difference of the two arrays?
if len(h1) > len(h2):
temp = len(h2)
else:
temp = len(h1)
diff = abs(h1-h2)
for i in range(temp):
A = numpy.repeat(0, 1)
for i in range(diff):
B = numpy.repeat(1)
C = numpy.concatenate((A,B), axis=0)
Maybe not the most dynamic or kindest way to go about this but if your solution requires just that, then it could do the job in the meantime.

How to efficiently delete all indices with certain value in one list out of both lists?

I have two numpy arrays from which I am trying to delete all indices which have the value -1 in the second array.
Example:
goldLabels = np.array([12, 2, 0, 0, 0, 1, 5])
predictions = np.array([12, 3, 0, 2, -1, -1, -1])
Expected result:
>>> print(goldLabels)
[12, 2, 0, 0]
>>> print(predictions)
[12, 3, 0, 2]
This is my code so far:
idcs = []
for idx, label in enumerate(goldLabels):
if label == -1:
idcs.append(idx)
goldLabels = np.delete(goldLabels, idcs)
predictions = np.delete(predictions, idcs)
Is there any way to do this more efficiently?
You can use numpy's capabilities to directly extract those numbers using a mask:
goldLabels = np.array([12, 2, 0, 0, 0, 1, 5])
predictions = np.array([12, 3, 0, 2, -1, -1, -1])
mask = predictions!=-1
predictions = predictions[mask]
goldLabels = goldLabels[mask]
print(goldLabels)
print(predictions)
Output:
[12 2 0 0]
[12 3 0 2]

numpy array set ones between two values, fast

having been looking for solution for this problem for a while but can't seem to find anything.
For example, I have an numpy array of
[ 0, 0, 2, 3, 2, 4, 3, 4, 0, 0, -2, -1, -4, -2, -1, -3, -4, 0, 2, 3, -2, -1, 0]
what I would like to achieve is the generate another array to indicate the elements between a pair of numbers, let's say between 2 and -2 here. So I want to get an array like this
[ 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0]
Notice any 2 or -2 between a pair of (2, -2) are ignored. Any easy approach is to iterate through each element with for loop and identifies first occurrence of 2 and set everything after that to 1 until you hit an -2 and start look for the next 2 again.
But I would like this process to be faster as I have over 1000 elements in an numpy array. and this process needs to be done a lot of times. Do you guys know any elegant way to solve this? Thanks in advance!
Quite a problem that is! Listed in this post is a vectorized solution (hopefully the inlined comments would help to explain the logic behind it). I am assuming A as the input array with T1, T2 as the start and stop triggers.
def setones_between_triggers(A,T1,T2):
# Get start and stop indices corresponding to rising and falling triggers
start = np.where(A==T1)[0]
stop = np.where(A==T2)[0]
# Take care of boundary conditions for np.searchsorted to work
if (stop[-1] < start[-1]) & (start[-1] != A.size-1):
stop = np.append(stop,A.size-1)
# This is where the magic happens.
# Validate (filter out) the triggers based on the set conditions :
# 1. See if there are more than one stop indices between two start indices.
# If so, use the first one and rejecting all others in that in-between space.
# 2. Repeat the same check for start, but use the validated start indices.
# First off, take care of out-of-bound cases for proper indexing
stop_valid_idx = np.unique(np.searchsorted(stop,start,'right'))
stop_valid_idx = stop_valid_idx[stop_valid_idx < stop.size]
stop_valid = stop[stop_valid_idx]
_,idx = np.unique(np.searchsorted(stop_valid,start,'left'),return_index=True)
start_valid = start[idx]
# Create shifts array (array filled with zeros, unless triggered by T1 and T2
# for which we have +1 and -1 as triggers).
shifts = np.zeros(A.size,dtype=int)
shifts[start_valid] = 1
shifts[stop_valid] = -1
# Perform cumm. summation that would almost give us the desired output
out = shifts.cumsum()
# For a worst case when we have two groups of (T1,T2) adjacent to each other,
# set the negative trigger position as 1 as well
out[stop_valid] = 1
return out
Sample runs
Original sample case :
In [1589]: A
Out[1589]:
array([ 0, 0, 2, 3, 2, 4, 3, 4, 0, 0, -2, -1, -4, -2, -1, -3, -4,
0, 2, 3, -2, -1, 0])
In [1590]: setones_between_triggers(A,2,-2)
Out[1590]: array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])
Worst case #1 (adjacent (2,-2) groups) :
In [1595]: A
Out[1595]:
array([-2, 2, 0, 2, -2, 2, 2, 2, 4, -2, 0, -2, -2, -4, -2, -1, 2,
-4, 0, 2, 3, -2, -2, 0])
In [1596]: setones_between_triggers(A,2,-2)
Out[1596]:
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0], dtype=int32)
Worst case #2 (2 without any -2 till end) :
In [1603]: A
Out[1603]:
array([-2, 2, 0, 2, -2, 2, 2, 2, 4, -2, 0, -2, -2, -4, -2, -1, -2,
-4, 0, 2, 3, 5, 6, 0])
In [1604]: setones_between_triggers(A,2,-2)
Out[1604]:
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1], dtype=int32)
Assuming you have got a huge dataset, I prefer to do a pair of initial searches for the two boundaries then use for-loop on these indices for validation.
def between_pairs(x, b1, b2):
# output vector
out = np.zeros_like(x)
# reversed list of indices for possible rising and trailing edges
rise_edges = list(np.argwhere(x==b1)[::-1,0])
trail_edges = list(np.argwhere(x==b2)[::-1,0])
# determine the rising trailing edge pairs
rt_pairs = []
t = None
# look for the next rising edge after the previous trailing edge
while rise_edges:
r = rise_edges.pop()
if t is not None and r < t:
continue
# look for the next trailing edge after previous rising edge
while trail_edges:
t = trail_edges.pop()
if t > r:
rt_pairs.append((r, t))
break
# use the rising, trailing pairs for updating d
for rt in rt_pairs:
out[rt[0]:rt[1]+1] = 1
return out
# Example
a = np.array([0, 0, 2, 3, 2, 4, 3, 4, 0, 0, -2, -1, -4, -2, -1, -3, -4,
0, 2, 3, -2, -1, 0])
d = between_pairs(a , 2, -2)
print repr(d)
## -- End pasted text --
array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])
I did a speed comparison with the alternative answer given by #CactusWoman
def between_vals(x, val1, val2):
out = np.zeros(x.shape, dtype = int)
in_range = False
for i, v in enumerate(x):
if v == val1 and not in_range:
in_range = True
if in_range:
out[i] = 1
if v == val2 and in_range:
in_range = False
return out
I found the following
In [59]: a = np.random.choice(np.arange(-5, 6), 2000)
In [60]: %timeit between_vals(a, 2, -2)
1000 loops, best of 3: 681 µs per loop
In [61]: %timeit between_pairs(a, 2, -2)
1000 loops, best of 3: 182 µs per loop
and for a much smaller dataset,
In [72]: a = np.random.choice(np.arange(-5, 6), 50)
In [73]: %timeit between_vals(a, 2, -2)
10000 loops, best of 3: 17 µs per loop
In [74]: %timeit between_pairs(a, 2, -2)
10000 loops, best of 3: 34.7 µs per loop
Therefore it all depends on your dataset size.
Is iterating through the array really too slow?
def between_vals(x, val1, val2):
out = np.zeros(x.shape, dtype = int)
in_range = False
for i, v in enumerate(x):
if v == val1 and not in_range:
in_range = True
if in_range:
out[i] = 1
if v == val2 and in_range:
in_range = False
return out
I'm the same boat as #Randy C: nothing else I've tried is faster than this.
I've tried a few things at this point, and the need to keep track of state for the start/finish markers has made the more clever things I've tried slower than the dumb iterative approach I used as a check:
for _ in xrange(1000):
a = np.random.choice(np.arange(-5, 6), 2000)
found2 = False
l = []
for el in a:
if el == 2:
found2 = True
l.append(1 if found2 else 0)
if el == -2:
found2 = False
l = np.array(l)

Categories