numpy array set ones between two values, fast - python

having been looking for solution for this problem for a while but can't seem to find anything.
For example, I have an numpy array of
[ 0, 0, 2, 3, 2, 4, 3, 4, 0, 0, -2, -1, -4, -2, -1, -3, -4, 0, 2, 3, -2, -1, 0]
what I would like to achieve is the generate another array to indicate the elements between a pair of numbers, let's say between 2 and -2 here. So I want to get an array like this
[ 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0]
Notice any 2 or -2 between a pair of (2, -2) are ignored. Any easy approach is to iterate through each element with for loop and identifies first occurrence of 2 and set everything after that to 1 until you hit an -2 and start look for the next 2 again.
But I would like this process to be faster as I have over 1000 elements in an numpy array. and this process needs to be done a lot of times. Do you guys know any elegant way to solve this? Thanks in advance!

Quite a problem that is! Listed in this post is a vectorized solution (hopefully the inlined comments would help to explain the logic behind it). I am assuming A as the input array with T1, T2 as the start and stop triggers.
def setones_between_triggers(A,T1,T2):
# Get start and stop indices corresponding to rising and falling triggers
start = np.where(A==T1)[0]
stop = np.where(A==T2)[0]
# Take care of boundary conditions for np.searchsorted to work
if (stop[-1] < start[-1]) & (start[-1] != A.size-1):
stop = np.append(stop,A.size-1)
# This is where the magic happens.
# Validate (filter out) the triggers based on the set conditions :
# 1. See if there are more than one stop indices between two start indices.
# If so, use the first one and rejecting all others in that in-between space.
# 2. Repeat the same check for start, but use the validated start indices.
# First off, take care of out-of-bound cases for proper indexing
stop_valid_idx = np.unique(np.searchsorted(stop,start,'right'))
stop_valid_idx = stop_valid_idx[stop_valid_idx < stop.size]
stop_valid = stop[stop_valid_idx]
_,idx = np.unique(np.searchsorted(stop_valid,start,'left'),return_index=True)
start_valid = start[idx]
# Create shifts array (array filled with zeros, unless triggered by T1 and T2
# for which we have +1 and -1 as triggers).
shifts = np.zeros(A.size,dtype=int)
shifts[start_valid] = 1
shifts[stop_valid] = -1
# Perform cumm. summation that would almost give us the desired output
out = shifts.cumsum()
# For a worst case when we have two groups of (T1,T2) adjacent to each other,
# set the negative trigger position as 1 as well
out[stop_valid] = 1
return out
Sample runs
Original sample case :
In [1589]: A
Out[1589]:
array([ 0, 0, 2, 3, 2, 4, 3, 4, 0, 0, -2, -1, -4, -2, -1, -3, -4,
0, 2, 3, -2, -1, 0])
In [1590]: setones_between_triggers(A,2,-2)
Out[1590]: array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])
Worst case #1 (adjacent (2,-2) groups) :
In [1595]: A
Out[1595]:
array([-2, 2, 0, 2, -2, 2, 2, 2, 4, -2, 0, -2, -2, -4, -2, -1, 2,
-4, 0, 2, 3, -2, -2, 0])
In [1596]: setones_between_triggers(A,2,-2)
Out[1596]:
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0], dtype=int32)
Worst case #2 (2 without any -2 till end) :
In [1603]: A
Out[1603]:
array([-2, 2, 0, 2, -2, 2, 2, 2, 4, -2, 0, -2, -2, -4, -2, -1, -2,
-4, 0, 2, 3, 5, 6, 0])
In [1604]: setones_between_triggers(A,2,-2)
Out[1604]:
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1], dtype=int32)

Assuming you have got a huge dataset, I prefer to do a pair of initial searches for the two boundaries then use for-loop on these indices for validation.
def between_pairs(x, b1, b2):
# output vector
out = np.zeros_like(x)
# reversed list of indices for possible rising and trailing edges
rise_edges = list(np.argwhere(x==b1)[::-1,0])
trail_edges = list(np.argwhere(x==b2)[::-1,0])
# determine the rising trailing edge pairs
rt_pairs = []
t = None
# look for the next rising edge after the previous trailing edge
while rise_edges:
r = rise_edges.pop()
if t is not None and r < t:
continue
# look for the next trailing edge after previous rising edge
while trail_edges:
t = trail_edges.pop()
if t > r:
rt_pairs.append((r, t))
break
# use the rising, trailing pairs for updating d
for rt in rt_pairs:
out[rt[0]:rt[1]+1] = 1
return out
# Example
a = np.array([0, 0, 2, 3, 2, 4, 3, 4, 0, 0, -2, -1, -4, -2, -1, -3, -4,
0, 2, 3, -2, -1, 0])
d = between_pairs(a , 2, -2)
print repr(d)
## -- End pasted text --
array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])
I did a speed comparison with the alternative answer given by #CactusWoman
def between_vals(x, val1, val2):
out = np.zeros(x.shape, dtype = int)
in_range = False
for i, v in enumerate(x):
if v == val1 and not in_range:
in_range = True
if in_range:
out[i] = 1
if v == val2 and in_range:
in_range = False
return out
I found the following
In [59]: a = np.random.choice(np.arange(-5, 6), 2000)
In [60]: %timeit between_vals(a, 2, -2)
1000 loops, best of 3: 681 µs per loop
In [61]: %timeit between_pairs(a, 2, -2)
1000 loops, best of 3: 182 µs per loop
and for a much smaller dataset,
In [72]: a = np.random.choice(np.arange(-5, 6), 50)
In [73]: %timeit between_vals(a, 2, -2)
10000 loops, best of 3: 17 µs per loop
In [74]: %timeit between_pairs(a, 2, -2)
10000 loops, best of 3: 34.7 µs per loop
Therefore it all depends on your dataset size.

Is iterating through the array really too slow?
def between_vals(x, val1, val2):
out = np.zeros(x.shape, dtype = int)
in_range = False
for i, v in enumerate(x):
if v == val1 and not in_range:
in_range = True
if in_range:
out[i] = 1
if v == val2 and in_range:
in_range = False
return out
I'm the same boat as #Randy C: nothing else I've tried is faster than this.

I've tried a few things at this point, and the need to keep track of state for the start/finish markers has made the more clever things I've tried slower than the dumb iterative approach I used as a check:
for _ in xrange(1000):
a = np.random.choice(np.arange(-5, 6), 2000)
found2 = False
l = []
for el in a:
if el == 2:
found2 = True
l.append(1 if found2 else 0)
if el == -2:
found2 = False
l = np.array(l)

Related

Concatenate two numpy arrays so that index order keeps the same?

Assume I have two numpy arrays as follows:
{0: array([ 2, 4, 8, 9, 12], dtype=int64),
1: array([ 1, 3, 5], dtype=int64)}
Now I want to replace each array with the ID at the front, i.e. the values in array 0 become 0 and in array 1 become 1, then both arrays should be merged, whereby the index order must be correct.
I.e. desired output:
array([1, 0, 1, 0, 1, 0, 0 ,0])
But that's what I get:
np.concatenate((h1,h2), axis=0)
array([0, 0, 0, 0, 0, 1, 1, 1])
(Each array contains only unique values, if this helps.)
How can this be done?
Your description of merging is a bit unclear. But here's something that makes sense
In [399]: dd ={0: np.array([ 2, 4, 8, 9, 12]),
...: 1: np.array([ 1, 3, 5])}
In [403]: res = np.zeros(13, int)
In [404]: res[dd[0]] = 0
In [405]: res[dd[1]] = 1
In [406]: res
Out[406]: array([0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])
Or to make the assignments clearer:
In [407]: res = np.zeros(13, int)
In [408]: res[dd[0]] = 2
In [409]: res[dd[1]] = 1
In [410]: res
Out[410]: array([0, 1, 2, 1, 2, 1, 0, 0, 2, 2, 0, 0, 2])
Otherwise the talk index positions doesn't make a whole lot of sense.
Something like this?
d = {0: array([ 2, 4, 8, 9, 12], dtype=int64),
1: array([ 1, 3, 5], dtype=int64)}
(np.concatenate([d[0],d[1]]).argsort(kind="stable")>=len(d[0])).view(np.uint8)
# array([1, 0, 1, 0, 1, 0, 0, 0], dtype=uint8)
.concatenate Just appends lists/arrays.
Maybe an unconventional way to go about it, but you could repeat the [0 1] pattern for the len of the shortest array, using numpy.repeat and then add repeated 1 values for the difference of the two arrays?
if len(h1) > len(h2):
temp = len(h2)
else:
temp = len(h1)
diff = abs(h1-h2)
for i in range(temp):
A = numpy.repeat(0, 1)
for i in range(diff):
B = numpy.repeat(1)
C = numpy.concatenate((A,B), axis=0)
Maybe not the most dynamic or kindest way to go about this but if your solution requires just that, then it could do the job in the meantime.

Count since last occurence in NumPy

Seemingly straightforward problem: I want to create an array that gives the count since the last occurence of a given condition. In this condition, let the condition be that a > 0:
in: [0, 0, 5, 0, 0, 2, 1, 0, 0]
out: [0, 0, 0, 1, 2, 0, 0, 1, 2]
I assume step one would be something like np.cumsum(a > 0), but not sure where to go from there.
Edit: Should clarify that I want to do this without iteration.
Numpy one-liner:
x = numpy.array([0, 0, 5, 0, 0, 2, 1, 0, 0])
result = numpy.arange(len(x)) - numpy.maximum.accumulate(numpy.arange(len(x)) * (x > 0))
Gives
[0, 1, 0, 1, 2, 0, 0, 1, 2]
If you want to have zeros in the beginning, turn it to zero explicitly:
result[:numpy.nonzero(x)[0][0]] = 0
Split the array based on the condition and use the lengths of the remaining pieces and the condition state of the first and last element in the array.
A pure python solution:
result = []
delta = 0
for val in [0, 0, 5, 0, 0, 2, 1, 0, 0]:
delta += 1
if val > 0:
delta = 0
result.append(delta)

Finding the consecutive zeros in a numpy array

I have the following array
a = [1, 2, 3, 0, 0, 0, 0, 0, 0, 4, 5, 6, 0, 0, 0, 0, 9, 8, 7,0,10,11]
I would like to find the start and the end index of the array where the values are zeros consecutively. For the array above the output would be as follows
[3,8],[12,15],[19]
I want to achieve this as efficiently as possible.
Here's a fairly compact vectorized implementation. I've changed the requirements a bit, so the return value is a bit more "numpythonic": it creates an array with shape (m, 2), where m is the number of "runs" of zeros. The first column is the index of the first 0 in each run, and the second is the index of the first nonzero element after the run. (This indexing pattern matches, for example, how slicing works and how the range function works.)
import numpy as np
def zero_runs(a):
# Create an array that is 1 where a is 0, and pad each end with an extra 0.
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
# Runs start and end where absdiff is 1.
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
For example:
In [236]: a = [1, 2, 3, 0, 0, 0, 0, 0, 0, 4, 5, 6, 0, 0, 0, 0, 9, 8, 7, 0, 10, 11]
In [237]: runs = zero_runs(a)
In [238]: runs
Out[238]:
array([[ 3, 9],
[12, 16],
[19, 20]])
With this format, it is simple to get the number of zeros in each run:
In [239]: runs[:,1] - runs[:,0]
Out[239]: array([6, 4, 1])
It's always a good idea to check the edge cases:
In [240]: zero_runs([0,1,2])
Out[240]: array([[0, 1]])
In [241]: zero_runs([1,2,0])
Out[241]: array([[2, 3]])
In [242]: zero_runs([1,2,3])
Out[242]: array([], shape=(0, 2), dtype=int64)
In [243]: zero_runs([0,0,0])
Out[243]: array([[0, 3]])
You can use itertools to achieve your expected result.
from itertools import groupby
a= [1, 2, 3, 0, 0, 0, 0, 0, 0, 4, 5, 6, 0, 0, 0, 0, 9, 8, 7,0,10,11]
b = range(len(a))
for group in groupby(iter(b), lambda x: a[x]):
if group[0]==0:
lis=list(group[1])
print [min(lis),max(lis)]
Here is a custom function, not sure the most efficient but works :
def getZeroIndexes(li):
begin = 0
end = 0
indexes = []
zero = False
for ind,elt in enumerate(li):
if not elt and not zero:
begin = ind
zero = True
if not elt and zero:
end = ind
if elt and zero:
zero = False
if begin == end:
indexes.append(begin)
else:
indexes.append((begin, end))
return indexes

how to use the steady state probability to select a state in each iteration of the python code?

I have an ergodic markov chain whit three states. I calculated the steady state probability.
the state present the input of my problem .
I want to solve my problem for n iteration which in each one we select the input based on the calculated steady state probability.
In the words, this is same a having three options with specific probability. and we want to select one of them randomly in each iteration.
Do you have any suggestion ??
Best,
Aissan
Let's assume you have a vector of probabilities (instead of just 3) and also that your initial state is the first one.
import random
def markov(probs, iter):
# normalize the probabilities
total = sum(probs)
probs = map(lambda e: float(e)/total, probs)
# determine the number of states
n = len(probs)
# Set the initial state
s = 0
for i in xrange(iter):
thresh = random.random()
buildup = 0
# When the sum of our probability vector is greater than `thresh`
# we've found the next state
for j in xrange(n):
buildup += probs[j]
if buildup >= thresh:
break
# Set the new state
s = j
return s
And thus
>>> markov([1,1,1], 100)
2
>>> markov([1,1,1], 100)
1
But this only returns the last state. It's easy to fix this with a neat trick, though. Let's turn this into a generator. We literally just need one more line, yield s.
def markov(probs, iter):
# ...
for i in xrange(iter):
# Yield the current state
yield s
# ...
for j in xrange(n):
# ...
Now when we call markov we don't get an immediate response.
>>> g = markov([1,1,1], 100)
>>> g
<generator object markov at 0x10fce3280>
Instead we get a generator object which is kind of like a "frozen" loop. You can step it once with next
>>> g.next()
1
>>> g.next()
1
>>> g.next()
2
Or even unwind the whole thing with list
>>> list(markov([1,1,1], 100))
[0, 0, 1, 1, 0, 0, 0, 2, 1, 1, 2, 0, 1, 0, 0, 1, 2, 2, 2, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2, 2, 1, 0, 0, 0, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 0, 2, 2, 1, 0, 1, 1, 1, 2, 2, 2, 2, 0, 2, 1, 0, 0, 1, 2, 0, 0, 1, 2, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 2, 1, 1, 0, 1, 1, 2, 2, 2, 1, 1, 0, 0, 0]

Compare multiple columns in numpy array

I have a 2D numpy array with about 12 columns and 1000+ rows and each cell contains a number from 1 to 5. I'm searching for the best sextuple of columns according to my point system where 1 and 2 generate -1 point and 4 and 5 gives +1.
If a row in a certain sextuple contains, for example, [1, 4, 5, 3, 4, 3] the point for this row should be +2, because 3*1 + 1*(-1) = 2. Next row may be [1, 2, 2, 3, 3, 3] and should be -3 points.
At first, I tried a strait forward loop solution but I realized there are 665 280 possible combinations of columns to compare and when I also need to search for the best quintuple, quadruple etc. the loop is taking forever.
Is there perhaps a smarter numpy-way of solving my problem?
import numpy as np
import itertools
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
print(y)
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))
print('''\
score: {s}
sextuple: {c}
'''.format(s = score, c = best_sextuple))
yields, for example,
score: 6
sextuple: (0, 1, 5, 8, 10, 11)
Explanation:
First, let's generate a random example, with 12 columns and 10 rows:
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
Now we can use numpy indexing to convert the numbers in arr 1,2,...,5 to the values -1,0,1 (according to your scoring system):
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
Next, let's use itertools.combinations to generate all possible combinations of 6 columns:
for cols in itertools.combinations(range(12),6)
and
y[:,cols].sum()
then gives the score for cols, a choice of columns (sextuple).
Finally, use max to pick off the sextuple with the best score:
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))
import numpy
A = numpy.random.randint(1, 6, size=(1000, 12))
points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5)
columnsums = numpy.sum(points, 0)
def best6(row):
return numpy.argsort(row)[-6:]
bestcolumns = best6(columnsums)
allbestcolumns = map(best6, points)
bestcolumns will now contain the best 6 columns in ascending order. By similar logic, allbestcolumns will contain the best six columns in each row.
Extending on unutbu's longer answer above, it's possible to generate the masked array of scores automatically. Since your scores for values are consistent every pass through the loop, so the scores for each value only need to be calculated once. Here's slightly inelegant way to do it on an example 6x10 array, before and after your scores are applied.
>>> import numpy
>>> values = numpy.random.randint(6, size=(6,10))
>>> values
array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4],
[2, 5, 2, 2, 3, 1, 3, 5, 3, 1],
[3, 3, 5, 4, 2, 1, 4, 0, 0, 1],
[2, 4, 0, 0, 4, 1, 4, 0, 1, 0],
[0, 4, 1, 2, 0, 3, 3, 5, 0, 1],
[2, 3, 3, 4, 0, 1, 1, 1, 3, 2]])
>>> b = values.copy()
>>> b[ b<3 ] = -1
>>> b[ b==3 ] = 0
>>> b[ b>3 ] = 1
>>> b
array([[ 1, 1, -1, -1, -1, 1, -1, -1, -1, 1],
[-1, 1, -1, -1, 0, -1, 0, 1, 0, -1],
[ 0, 0, 1, 1, -1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, 1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, -1, 0, 0, 1, -1, -1],
[-1, 0, 0, 1, -1, -1, -1, -1, 0, -1]])
Incidentally, this thread claims that creating the combinations directly within numpy will yield around 5x faster performance than itertools, though perhaps at the expense of some readability.

Categories