Numpy sum running length of non-zero values - python

Looking for a fast vectorized function that returns the rolling number of consecutive non-zero values. The count should start over at 0 whenever encountering a zero. The result should have the same shape as the input array.
Given an array like this:
x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
The function should return this:
array([1, 2, 3, 0, 0, 1, 0, 1, 2])

This post lists a vectorized approach which basically consists of two steps:
Initialize a zeros vector of the same size as input vector, x and set ones at places corresponding to non-zeros of x.
Next up, in that vector, we need to put minus of runlengths of each island right after the ending/stop positions for each "island". The intention is to use cumsum again later on, which would result in sequential numbers for the "islands" and zeros elsewhere.
Here's the implementation -
import numpy as np
#Append zeros at the start and end of input array, x
xa = np.hstack([[0],x,[0]])
# Get an array of ones and zeros, with ones for nonzeros of x and zeros elsewhere
xa1 =(xa!=0)+0
# Find consecutive differences on xa1
xadf = np.diff(xa1)
# Find start and stop+1 indices and thus the lengths of "islands" of non-zeros
starts = np.where(xadf==1)[0]
stops_p1 = np.where(xadf==-1)[0]
lens = stops_p1 - starts
# Mark indices where "minus ones" are to be put for applying cumsum
put_m1 = stops_p1[[stops_p1 < x.size]]
# Setup vector with ones for nonzero x's, "minus lens" at stops +1 & zeros elsewhere
vec = xa1[1:-1] # Note: this will change xa1, but it's okay as not needed anymore
vec[put_m1] = -lens[0:put_m1.size]
# Perform cumsum to get the desired output
out = vec.cumsum()
Sample run -
In [116]: x
Out[116]: array([ 0. , 2.3, 1.2, 4.1, 0. , 0. , 5.3, 0. , 1.2, 3.1, 0. ])
In [117]: out
Out[117]: array([0, 1, 2, 3, 0, 0, 1, 0, 1, 2, 0], dtype=int32)
Runtime tests -
Here's some runtimes tests comparing the proposed approach against the other itertools.groupby based approach -
In [21]: N = 1000000
...: x = np.random.rand(1,N)
...: x[x>0.5] = 0.0
...: x = x.ravel()
...:
In [19]: %timeit sumrunlen_vectorized(x)
10 loops, best of 3: 19.9 ms per loop
In [20]: %timeit sumrunlen_loopy(x)
1 loops, best of 3: 2.86 s per loop

You can use itertools.groupby and np.hstack :
>>> import numpy as np
>>> x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
>>> from itertools import groupby
>>> np.hstack([[i if j!=0 else j for i,j in enumerate(g,1)] for _,g in groupby(x,key=lambda x: x!=0)])
array([ 1., 2., 3., 0., 0., 1., 0., 1., 2.])
We can group the array elements based on non-zero elements then use a list comprehension and enumerate to replace the non-zero sub-arrays with those index then flatten the list with np.hstack.

This sub-problem came up in Kick Start 2021 Round A for me. My solution:
def current_run_len(a):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
a_[stops + 1] = -(stops - starts) # +1 for behind-last
return a_[1:-1].cumsum()
In fact, the problem also required a version where you count down consecutive sequences. Thus here another version with an optional keyword argument which does the same for rev=False:
def current_run_len(a, rev=False):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
if rev:
a_[starts] = -(stops - starts)
cs = -a_.cumsum()[:-2]
else:
a_[stops + 1] = -(stops - starts) # +1 for behind-last
cs = a_.cumsum()[1:-1]
return cs
Results:
a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1])
print('a = ', a)
print('current_run_len(a) = ', current_run_len(a))
print('current_run_len(a, rev=True) = ', current_run_len(a, rev=True))
a = [1 1 1 1 0 0 0 1 1 0 1 0 0 0 1]
current_run_len(a) = [1 2 3 4 0 0 0 1 2 0 1 0 0 0 1]
current_run_len(a, rev=True) = [4 3 2 1 0 0 0 2 1 0 1 0 0 0 1]
For an array that consists of 0s and 1s only, you can simplify [0, a != 0, 0] to [0, a, 0]. But the version as-posted also works for arbitrary non-zero numbers.

Related

Vectorizing calculation in matrix with interdependent values

I am tracking multiple discrete time-series at multiple temporal resolutions, resulting in an SxRxB matrix where S is the number of time-series, R is the number of different resolutions and B is the buffer, i.e. how many values each series remembers. Each series is discrete and uses a limited range of natural numbers to represent its values. I will call these "symbols" here.
For each series I want to calculate how often any of the previous measurement's symbols directly precedes any of the current measurement's symbols, over all measurements. I have solved this with a for-loop as seen below, but would like to vectorize it for obvious reasons.
I'm not sure if my way of structuring data is efficient, so I'm open for suggestions there. Especially the ratios matrix could be done differently I think.
Thanks in advance!
def supports_loop(data, num_series, resolutions, buffer_size, vocab_size):
# For small test matrices we can calculate the complete matrix without problems
indices = []
indices.append(xrange(num_series))
indices.append(xrange(vocab_size))
indices.append(xrange(num_series))
indices.append(xrange(vocab_size))
indices.append(xrange(resolutions))
# This is huge! :/
# dimensions:
# series and value for which we calculate,
# series and value which precedes that measurement,
# resolution
ratios = np.full((num_series, vocab_size, num_series, vocab_size, resolutions), 0.0)
for idx in itertools.product(*indices):
s0, v0 = idx[0],idx[1] # the series and symbol for which we calculate
s1, v1 = idx[2],idx[3] # the series and symbol which should precede the we're calculating for
res = idx[4]
# Find the positions where s0==v0
found0 = np.where(data[s0, res, :] == v0)[0]
if found0.size == 0:
continue
#print('found {}={} at {}'.format(s0, v0, found0))
# Check how often s1==v1 right before s0==v0
candidates = (s1, res, (found0 - 1 + buffer_size) % buffer_size)
found01 = np.count_nonzero(data[candidates] == v1)
if found01 == 0:
continue
print('found {}={} following {}={} at {}'.format(s0, v0, s1, v1, found01))
# total01 = number of positions where either s0 or s1 is defined (i.e. >=0)
total01 = len(np.argwhere((data[s0, res, :] >= 0) & (data[s1, res, :] >= 0)))
ratio = (float(found01) / total01) if total01 > 0 else 0.0
ratios[idx] = ratio
return ratios
def stackoverflow_example(fnc):
data = np.array([
[[0, 0, 1], # series 0, resolution 0
[1, 3, 2]], # series 0, resolution 1
[[2, 1, 2], # series 1, resolution 0
[3, 3, 3]], # series 1, resoltuion 1
])
num_series = data.shape[0]
resolutions = data.shape[1]
buffer_size = data.shape[2]
vocab_size = np.max(data)+1
ratios = fnc(data, num_series, resolutions, buffer_size, vocab_size)
coordinates = np.argwhere(ratios > 0.0)
nz_values = ratios[ratios > 0.0]
print(np.hstack((coordinates, nz_values[:,None])))
print('0/0 precedes 0/0 in 1 out of 3 cases: {}'.format(np.isclose(ratios[0,0,0,0,0], 1.0/3.0)))
print('1/2 precedes 0/0 in 2 out of 3 cases: {}'.format(np.isclose(ratios[0,0,1,2,0], 2.0/3.0)))
Expected output (21 pairs, 5 columns for coordinates, followed by found count):
[[0 0 0 0 0 1]
[0 0 0 1 0 1]
[0 0 1 2 0 2]
[0 1 0 0 0 1]
[0 1 0 2 1 1]
[0 1 1 1 0 1]
[0 1 1 3 1 1]
[0 2 0 3 1 1]
[0 2 1 3 1 1]
[0 3 0 1 1 1]
[0 3 1 3 1 1]
[1 1 0 0 0 1]
[1 1 1 2 0 1]
[1 2 0 0 0 1]
[1 2 0 1 0 1]
[1 2 1 1 0 1]
[1 2 1 2 0 1]
[1 3 0 1 1 1]
[1 3 0 2 1 1]
[1 3 0 3 1 1]
[1 3 1 3 1 3]]
In the example above the 0 in series 0 follows a 2 in series 1 in two out of three cases (since the buffers are circular), so the ratio at [0, 0, 1, 2, 0] will be ~0.6666. Also series 0, value 0 follows itself in one out of three cases, so the ratio at [0, 0, 0, 0, 0] will be ~0.3333. There are some others which are >0.0 as well.
I am testing each answer on two datasets: a tiny one (as shown above) and a more realistic one (100 series, 5 resolutions, 10 values per series, 50 symbols).
Results
Answer Time (tiny) Time (huge) All pairs found (tiny=21)
-----------------------------------------------------------------------
Baseline ~1ms ~675s (!) Yes
Saedeas ~0.13ms ~1.4ms No (!)
Saedeas2 ~0.20ms ~4.0ms Yes, +cross resolutions
Elliot_1 ~0.70ms ~100s (!) Yes
Elliot_2 ~1ms ~21s (!) Yes
Kuppern_1 ~0.39ms ~2.4s (!) Yes
Kuppern_2 ~0.18ms ~28ms Yes
Kuppern_3 ~0.19ms ~24ms Yes
David ~0.21ms ~27ms Yes
Saedeas 2nd approach is the clear winner! Thank you so much, all of you :)
To start, you're doing yourself a bit of a disservice by not explicitly nesting the for loops. You wind up repeating a lot of effort and not saving anything in terms of memory. When the loop is nested, you can move some of the computations from one level to another and figure out which inner loops can be vectorized over.
def supports_5_loop(data, num_series, resolutions, buffer_size, vocab_size):
ratios = np.full((num_series, vocab_size, num_series, vocab_size, resolutions), 0.0)
for res in xrange(resolutions):
for s0 in xrange(num_series):
# Find the positions where s0==v0
for v0 in np.unique(data[s0, res]):
# only need to find indices once for each series and value
found0 = np.where(data[s0, res, :] == v0)[0]
for s1 in xrange(num_series):
# Check how often s1==v1 right before s0==v0
candidates = (s1, res, (found0 - 1 + buffer_size) % buffer_size)
total01 = np.logical_or(data[s0, res, :] >= 0, data[s1, res, :] >= 0).sum()
# can skip inner loops if there are no candidates
if total01 == 0:
continue
for v1 in xrange(vocab_size):
found01 = np.count_nonzero(data[candidates] == v1)
if found01 == 0:
continue
ratio = (float(found01) / total01)
ratios[(s0, v0, s1, v1, res)] = ratio
return ratios
You'll see in the timings that the majority of the speed pickup comes from not duplicating effort.
Once you've made the nested structure, you can start looking at vectorizations and other optimizations.
def supports_4_loop(data, num_series, resolutions, buffer_size, vocab_size):
# For small test matrices we can calculate the complete matrix without problems
# This is huge! :/
# dimensions:
# series and value for which we calculate,
# series and value which precedes that measurement,
# resolution
ratios = np.full((num_series, vocab_size, num_series, vocab_size, resolutions), 0.0)
for res in xrange(resolutions):
for s0 in xrange(num_series):
# find the counts where either s0 or s1 are present
total01 = np.logical_or(data[s0, res] >= 0,
data[:, res] >= 0).sum(axis=1)
s1s = np.where(total01)[0]
# Find the positions where s0==v0
v0s, counts = np.unique(data[s0, res], return_counts=True)
# sorting before searching will show gains as the datasets
# get larger
indarr = np.argsort(data[s0, res])
i0 = 0
for v0, count in itertools.izip(v0s, counts):
found0 = indarr[i0:i0+count]
i0 += count
for s1 in s1s:
candidates = data[(s1, res, (found0 - 1) % buffer_size)]
# can replace the innermost loop with numpy functions
v1s, counts = np.unique(candidates, return_counts=True)
ratios[s0, v0, s1, v1s, res] = counts / total01[s1]
return ratios
Unfortunately I could only really vectorize over the innermost loop, and that only bought an additional 10% speedup. Outside of the innermost loop you can't guarantee that all the vectors are the same size, so you can't build an array.
In [121]: (np.all(supports_loop(data, num_series, resolutions, buffer_size, vocab_size) == supports_5_loop(data, num_series, resolutions, buffer_size, vocab_size)))
Out[121]: True
In [122]: (np.all(supports_loop(data, num_series, resolutions, buffer_size, vocab_size) == supports_4_loop(data, num_series, resolutions, buffer_size, vocab_size)))
Out[122]: True
In [123]: %timeit(supports_loop(data, num_series, resolutions, buffer_size, vocab_size))
2.29 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [124]: %timeit(supports_5_loop(data, num_series, resolutions, buffer_size, vocab_size))
949 µs ± 5.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [125]: %timeit(supports_4_loop(data, num_series, resolutions, buffer_size, vocab_size))
843 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If I'm understanding your problem correctly, I think this bit of code will get you the symbol pairs you're looking for in a relatively quick, vectorized fashion.
import numpy as np
import time
from collections import Counter
series = 2
resolutions = 2
buffer_len = 3
symbols = range(3)
#mat = np.random.choice(symbols, size=(series, resolutions, buffer_len)).astype('uint8')
mat = np.array([
[[0, 0, 1], # series 0, resolution 0
[1, 3, 2]], # series 0, resolution 1
[[2, 1, 2], # series 1, resolution 0
[3, 3, 3]], # series 1, resoltuion 1
])
start = time.time()
index_mat = np.indices(mat.shape)
right_shift_indices = np.roll(index_mat, -1, axis=3)
mat_shifted = mat[right_shift_indices[0], right_shift_indices[1], right_shift_indices[2]]
# These construct all the pairs directly
first_series = np.repeat(range(series), series*resolutions*buffer_len)
second_series = np.tile(np.repeat(range(series), resolutions*buffer_len), series)
res_loop = np.tile(np.repeat(range(resolutions), buffer_len), series*series)
mat_unroll = np.repeat(mat, series, axis=0)
shift_unroll = np.tile(mat_shifted, series)
# Constructs the pairs
pairs = zip(np.ravel(first_series),
np.ravel(second_series),
np.ravel(res_loop),
np.ravel(mat_unroll),
np.ravel(shift_unroll))
pair_time = time.time() - start
results = Counter(pairs)
end = time.time() - start
print("Mat: {}").format(mat)
print("Pairs: {}").format(results)
print("Number of Pairs: {}".format(len(pairs)))
print("Pair time is: {}".format(pair_time))
print("Count time is: {}".format(end-pair_time))
print("Total time is: {}".format(end))
The basic idea was to circularly shift each buffer by the appropriate amount depending on which time series it was (I think this is what your current code was doing). I can then generate all the symbol pairs by simply zipping lists offset by 1 together along the series axis.
Example output:
Mat: [[[0 0 1]
[1 3 2]]
[[2 1 2]
[3 3 3]]]
Pairs: Counter({(1, 1, 1, 3, 3): 3, (1, 0, 0, 2, 0): 2, (0, 0, 0, 0, 0): 1, (1, 1, 0, 2, 2): 1, (1, 1, 0, 2, 1): 1, (0, 1, 0, 0, 2): 1, (1, 0, 1, 3, 3): 1, (0, 0, 1, 1, 3): 1, (0, 0, 1, 3, 2): 1, (1, 0, 0, 1, 1): 1, (0, 1, 0, 0, 1): 1, (0, 1, 1, 2, 3): 1, (0, 1, 0, 1, 2): 1, (1, 1, 0, 1, 2): 1, (0, 1, 1, 3, 3): 1, (1, 0, 1, 3, 2): 1, (0, 0, 0, 0, 1): 1, (0, 1, 1, 1, 3): 1, (0, 0, 1, 2, 1): 1, (0, 0, 0, 1, 0): 1, (1, 0, 1, 3, 1): 1})
Number of Pairs: 24
Pair time is: 0.000135183334351
Count time is: 5.10215759277e-05
Total time is: 0.000186204910278
Edit: True final attempt. Fully vectorized.
A trick that makes this vectorizable is to make an array of comb[i] = buffer1[i]+buffer2[i-1]*voc_size for each pair of series. Each combination then gets a unique value in the array. And one can find the combination by doing v1[i] = comb[i] % voc_size, v2[i] = comb[i]//voc_size. As long as the number of series is not very high (<10000 i think) there is no point in doing any further vectorisations.
def support_vectorized(data, num_series, resolutions, buffer_size, vocab_size):
ratios = np.zeros((num_series, vocab_size, num_series, vocab_size, resolutions))
prev = np.roll(data, 1, axis=2) # Get previous values
prev *= vocab_size # To separate prev from data
for i, series in enumerate(data):
for j, prev_series in enumerate(prev):
comb = series + prev_series
for k, buffer in enumerate(comb):
idx, counts = np.unique(buffer, return_counts=True)
v = idx % vocab_size
v2 = idx // vocab_size
ratios[i, v, j, v2, k] = counts/buffer_size
return ratios
If however S or R is large, a full vectorization is possible but this uses a lot of memory:
def row_unique(comb):
comb.sort(axis=-1)
changes = np.concatenate((
np.ones((comb.shape[0], comb.shape[1], comb.shape[2], 1), dtype="bool"),
comb[:, :,:, 1:] != comb[:, :, :, :-1]), axis=-1)
vals = comb[changes]
idxs = np.nonzero(changes)
tmp = np.hstack((idxs[-1], 0))
counts = np.where(tmp[1:], np.diff(tmp), comb.shape[-1]-tmp[:-1])
return idxs, vals, counts
def supports_full_vectorized(data, num_series, resolutions, buffer_size, vocab_size):
ratios = np.zeros((num_series, vocab_size, num_series, vocab_size, resolutions))
prev = np.roll(data, 1, axis=2)*vocab_size
comb = data + prev[:, None] # Create every combination
idxs, vals, counts = row_unique(comb) # Get unique values and counts for each row
ratios[idxs[1], vals % vocab_size, idxs[0], vals // vocab_size, idxs[2]] = counts/buffer_size
return ratios
However, for S=100 this is slower than the previos solution. A middle ground is to keep a for loop over the series too reduce the memory usage:
def row_unique2(comb):
comb.sort(axis=-1)
changes = np.concatenate((
np.ones((comb.shape[0], comb.shape[1], 1), dtype="bool"),
comb[:, :, 1:] != comb[:, :, :-1]), axis=-1)
vals = comb[changes]
idxs = np.nonzero(changes)
tmp = np.hstack((idxs[-1], 0))
counts = np.where(tmp[1:], np.diff(tmp), comb.shape[-1]-tmp[:-1])
return idxs, vals, counts
def supports_half_vectorized(data, num_series, resolutions, buffer_size, vocab_size):
prev = np.roll(data, 1, axis=2)*vocab_size
ratios = np.zeros((num_series, vocab_size, num_series, vocab_size, resolutions))
for i, series in enumerate(data):
comb = series + prev
idxs, vals, counts = row_unique2(comb)
ratios[i, vals % vocab_size, idxs[0], vals // vocab_size, idxs[1]] = counts/buffer_size
return ratios
The running times for the different solutions show that support_half_vectorized is the fastest
In [41]: S, R, B, voc_size = (100, 5, 1000, 29)
In [42]: data = np.random.randint(voc_size, size=S*R*B).reshape((S, R, B))
In [43]: %timeit support_vectorized(data, S, R, B, voc_size)
1 loop, best of 3: 4.84 s per loop
In [44]: %timeit supports_full_vectorized(data, S, R, B, voc_size)
1 loop, best of 3: 5.3 s per loop
In [45]: %timeit supports_half_vectorized(data, S, R, B, voc_size)
1 loop, best of 3: 4.36 s per loop
In [46]: %timeit supports_4_loop(data, S, R, B, voc_size)
1 loop, best of 3: 36.7 s per loop
So this is kind of a cop out answer, but I've been working with #Saedeas's answer and based on timings on my machine have been able to optimize it slightly. I do believe that there is a way to do this without the loop, but the size of the intermediate array may be prohibitive.
The changes I have made have been to remove the concatenation that happens at the end of the run() function. This was creating a new array and is unnecessary. Instead we create the full size array at the beginning and just dont use the last row until the end.
Another change I have made is that the tiling of single was slightly inefficient. I have replaced this with very slightly faster code.
I do believe that this can be made faster, but would take some work. I was testing with larger sizes so please let me know what timings you get on your machine.
Code is below;
import numpy as np
import logging
import sys
import time
import itertools
import timeit
logging.basicConfig(stream=sys.stdout,
level=logging.DEBUG,
format='%(message)s')
def run():
series = 2
resolutions = 2
buffer_len = 3
symbols = range(50)
#mat = np.random.choice(symbols, size=(series, resolutions, buffer_len))
mat = np.array([
[[0, 0, 1], # series 0, resolution 0
[1, 3, 2]], # series 0, resolution 1
[[2, 1, 2], # series 1, resolution 0
[3, 3, 3]], # series 1, resoltuion 1
# [[4, 5, 6, 10],
# [7, 8, 9, 11]],
])
# logging.debug("Original:")
# logging.debug(mat)
start = time.time()
index_mat = np.indices((series, resolutions, buffer_len))
# This loop shifts all series but the one being looked at, and zips the
# element being looked at with every other member of that row
cross_pairs = np.empty((series, resolutions, buffer_len, series, 2), int)
#cross_pairs = []
right_shift_indices = [index_mat[0], index_mat[1], (index_mat[2] - 1) % buffer_len]
for i in range(series):
right_shift_indices[2][i] = (right_shift_indices[2][i] + 1) % buffer_len
# create a new matrix from the modified indices
mat_shifted = mat[right_shift_indices]
mat_shifted_t = mat_shifted.T.reshape(-1, series)
single = mat_shifted_t[:, i]
#print np.tile(single,(series-1,1)).T
#print single.reshape(-1,1).repeat(series-1,1)
#print single.repeat(series-1).reshape(-1,series-1)
mat_shifted_t = np.delete(mat_shifted_t, i, axis=1)
#cross_pairs[i,:,:,:-1] = (np.dstack((np.tile(single, (mat_shifted_t.shape[1], 1)).T, mat_shifted_t))).reshape(resolutions, buffer_len, (series-1), 2, order='F')
#cross_pairs[i,:,:,:-1] = (np.dstack((single.reshape(-1,1).repeat(series-1,1), mat_shifted_t))).reshape(resolutions, buffer_len, (series-1), 2, order='F')
cross_pairs[i,:,:,:-1] = np.dstack((single.repeat(series-1).reshape(-1,series-1), mat_shifted_t)).reshape(resolutions, buffer_len, (series-1), 2, order='F')
right_shift_indices[2][i] = (right_shift_indices[2][i] - 1) % buffer_len
#cross_pairs.extend([zip(itertools.repeat(x[i]), np.append(x[:i], x[i+1:])) for x in mat_shifted_t])
#consecutive_pairs = np.empty((series, resolutions, buffer_len, 2, 2), int)
#print "1", consecutive_pairs.shape
# tedious code to put this stuff in the right shape
in_series_zips = np.stack([mat[:, :, :-1], mat[:, :, 1:]], axis=3)
circular_in_series_zips = np.stack([mat[:, :, -1], mat[:, :, 0]], axis=2)
# This creates the final array.
# Index 0 is the preceding series
# Index 1 is the resolution
# Index 2 is the location in the buffer
# Index 3 is for the first n-1 elements, the following series, and for the last element
# it's the next element of the Index 0 series
# Index 4 is the index into the two element pair
cross_pairs[:,:,:-1,-1] = in_series_zips
cross_pairs[:,:,-1,-1] = circular_in_series_zips
end = time.time()
#logging.debug("Pairs encountered:")
#logging.debug(pairs)
logging.info("Elapsed: {}".format(end - start))
if __name__ == '__main__':
run()

Shift rows of a numpy array independently

This is an extension of the question posed here (quoted below)
I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll
values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing
tricks?
The accepted solution was:
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
result = A[rows, column_indices]
I would basically like to do the same thing, except when an index gets rolled "past" the end of the row, I would like the other side of the row to be padded with a NaN, rather than the value move to the "front" of the row in a periodic fashion.
Maybe using np.pad somehow? But I can't figure out how to get that to pad different rows by different amounts.
Inspired by Roll rows of a matrix independently's solution, here's a vectorized one based on np.lib.stride_tricks.as_strided -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),np.nan)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Sample run -
In [76]: a
Out[76]:
array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
In [77]: r
Out[77]: array([ 2, 0, -1])
In [78]: strided_indexing_roll(a, r)
Out[78]:
array([[nan, nan, 4.],
[ 1., 2., 3.],
[ 0., 5., nan]])
I was able to hack this together with linear indexing...it gets the right result but performs rather slowly on large arrays.
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]]).astype(float)
r = np.array([2, 0, -1])
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r_old = r.copy()
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
result = A[rows, column_indices]
# replace with NaNs
row_length = result.shape[-1]
pad_inds = []
for ind,i in np.enumerate(r_old):
if i > 0:
inds2pad = [np.ravel_multi_index((ind,) + (j,),result.shape) for j in range(i)]
pad_inds.extend(inds2pad)
if i < 0:
inds2pad = [np.ravel_multi_index((ind,) + (j,),result.shape) for j in range(row_length+i,row_length)]
pad_inds.extend(inds2pad)
result.ravel()[pad_inds] = nan
Gives the expected result:
print result
[[ nan nan 4.]
[ 1. 2. 3.]
[ 0. 5. nan]]
Based on #Seberg and #yann-dubois answers in the non-nan case, I've written a method that:
Is faster than the current answer
Works on ndarrays of any shape (specify the row-axis using the axis argument)
Allows for setting fill to either np.nan, any other "fill value" or False to allow regular rolling across the array edge.
Benchmarking
cols, rows = 1024, 2048
arr = np.stack(rows*(np.arange(cols,dtype=float),))
shifts = np.random.randint(-cols, cols, rows)
np.testing.assert_array_almost_equal(row_roll(arr, shifts), strided_indexing_roll(arr, shifts))
# True
%timeit row_roll(arr, shifts)
# 25.9 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit strided_indexing_roll(arr, shifts)
# 29.7 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def row_roll(arr, shifts, axis=1, fill=np.nan):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray, dtype int. Shape: `(arr.shape[:axis],)`.
Amount to roll each row by. Positive shifts row right.
axis : int
Axis along which elements are shifted.
fill: bool or float
If True, value to be filled at missing values. Otherwise just rolls across edges.
"""
if np.issubdtype(arr.dtype, int) and isinstance(fill, float):
arr = arr.astype(float)
shifts2 = shifts.copy()
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts2[shifts2 < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts2[:, np.newaxis]
result = arr[tuple(all_idcs)]
if fill is not False:
# Create mask of row positions above negative shifts
# or below positive shifts. Then set them to np.nan.
*_, nrows, ncols = arr.shape
mask_neg = shifts < 0
mask_pos = shifts >= 0
shifts_pos = shifts.copy()
shifts_pos[mask_neg] = 0
shifts_neg = shifts.copy()
shifts_neg[mask_pos] = ncols+1 # need to be bigger than the biggest positive shift
shifts_neg[mask_neg] = shifts[mask_neg] % ncols
indices = np.stack(nrows*(np.arange(ncols),))
nanmask = (indices < shifts_pos[:, None]) | (indices >= shifts_neg[:, None])
result[nanmask] = fill
arr = np.swapaxes(result,-1,axis)
return arr

Array of indizes of unique values

I start with an array a containing N unique values (product(a.shape) >= N).
I need to find the array b that has the index 0 .. N-1 from the (sorted) list of unique values in a at the positions of the respective elements in a.
As an example
import numpy as np
np.random.seed(42)
a = np.random.choice([0.1,1.3,7,9.4], size=(4,3))
print a
prints a as
[[ 7. 9.4 0.1]
[ 7. 7. 9.4]
[ 0.1 0.1 7. ]
[ 1.3 7. 7. ]]
The unique values are [0.1, 1.3, 7.0, 9.4], so the required outcome b would be
[[2 3 0]
[2 2 3]
[0 0 2]
[1 2 2]]
(e.g. the value at a[0,0] is 7.; 7. has the index 2; thus b[0,0] == 2.)
Since numpy does not have an index function,
I could do this using a loop. Either looping over the input array, like this:
u = np.unique(a).tolist()
af = a.flatten()
b = np.empty(len(af), dtype=int)
for i in range(len(af)):
b[i] = u.index(af[i])
b = b.reshape(a.shape)
print b
or looping over the unique values as follows:
u = np.unique(a)
b = np.empty(a.shape, dtype=int)
for i in range(len(u)):
b[np.where(a == u[i])] = i
print b
I suppose that the second way of looping over the unique values is already more efficient than the first in cases where not all values in a are distinct; but still, it involves this loop and is rather inefficient compared to inplace operations.
So my question is: What is the most efficient way of obtaining the array b filled with the indizes of the unique values of a?
You could use np.unique with its optional argument return_inverse -
np.unique(a, return_inverse=1)[1].reshape(a.shape)
Sample run -
In [308]: a
Out[308]:
array([[ 7. , 9.4, 0.1],
[ 7. , 7. , 9.4],
[ 0.1, 0.1, 7. ],
[ 1.3, 7. , 7. ]])
In [309]: np.unique(a, return_inverse=1)[1].reshape(a.shape)
Out[309]:
array([[2, 3, 0],
[2, 2, 3],
[0, 0, 2],
[1, 2, 2]])
Going through the source code of np.unique that looks pretty efficient to me, but still pruning out the un-necessary parts, we would end up with another solution, like so -
def unique_return_inverse(a):
ar = a.flatten()
perm = ar.argsort()
aux = ar[perm]
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
iflag = np.cumsum(flag) - 1
inv_idx = np.empty(ar.shape, dtype=np.intp)
inv_idx[perm] = iflag
return inv_idx
Timings -
In [444]: a= np.random.randint(0,1000,(1000,400))
In [445]: np.allclose( np.unique(a, return_inverse=1)[1],unique_return_inverse(a))
Out[445]: True
In [446]: %timeit np.unique(a, return_inverse=1)[1]
10 loops, best of 3: 30.4 ms per loop
In [447]: %timeit unique_return_inverse(a)
10 loops, best of 3: 29.5 ms per loop
Not a great deal of improvement there over the built-in.

Count number of clusters of non-zero values in Python?

My data looks something like this:
a=[0,0,0,0,0,0,10,15,16,12,11,9,10,0,0,0,0,0,6,9,3,7,5,4,0,0,0,0,0,0,4,3,9,7,1]
Essentially, there's a bunch of zeroes before non-zero numbers and I am looking to count the number of groups of non-zero numbers separated by zeros. In the example data above, there are 3 groups of non-zero data so the code should return 3.
Number of zeros between groups of non-zeros is variable
Any good ways to do this in python? (Also using Pandas and Numpy to help parse the data)
With a as the input array, we could have a vectorized solution -
m = a!=0
out = (m[1:] > m[:-1]).sum() + m[0]
Alternatively for performance, we might use np.count_nonzero which is very efficient to count bools as is the case here, like so -
out = np.count_nonzero(m[1:] > m[:-1]) + m[0]
Basically, we get a mask of non-zeros and count rising edges. To account for the first element that could be non-zero too and would not have any rising edge, we need to check it and add to the total sum.
Also, please note that if input a is a list, we need to use m = np.asarray(a)!=0 instead.
Sample runs for three cases -
In [92]: a # Case1 :Given sample
Out[92]:
array([ 0, 0, 0, 0, 0, 0, 10, 15, 16, 12, 11, 9, 10, 0, 0, 0, 0,
0, 6, 9, 3, 7, 5, 4, 0, 0, 0, 0, 0, 0, 4, 3, 9, 7,
1])
In [93]: m = a!=0
In [94]: (m[1:] > m[:-1]).sum() + m[0]
Out[94]: 3
In [95]: a[0] = 7 # Case2 :Add a non-zero elem/group at the start
In [96]: m = a!=0
In [97]: (m[1:] > m[:-1]).sum() + m[0]
Out[97]: 4
In [99]: a[-2:] = [0,4] # Case3 :Add a non-zero group at the end
In [100]: m = a!=0
In [101]: (m[1:] > m[:-1]).sum() + m[0]
Out[101]: 5
You may achieve it via using itertools.groupby() with list comprehension expression as:
>>> from itertools import groupby
>>> len([is_true for is_true, _ in groupby(a, lambda x: x!=0) if is_true])
3
simple python solution, just count changes from 0 to non-zero, by keeping track of the previous value (rising edge detection):
a=[0,0,0,0,0,0,10,15,16,12,11,9,10,0,0,0,0,0,6,9,3,7,5,4,0,0,0,0,0,0,4,3,9,7,1]
previous = 0
count = 0
for c in a:
if previous==0 and c!=0:
count+=1
previous = c
print(count) # 3
pad array with a zero on both sides with np.concatenate
find where zero with a == 0
find boundaries with np.diff
sum up boundaries found with sum
divide by two because we will have found twice as many as we want
def nonzero_clusters(a):
return int(np.diff(np.concatenate([[0], a, [0]]) == 0).sum() / 2)
demonstration
nonzero_clusters(
[0,0,0,0,0,0,10,15,16,12,11,9,10,0,0,0,0,0,6,9,3,7,5,4,0,0,0,0,0,0,4,3,9,7,1]
)
3
nonzero_clusters([0, 1, 2, 0, 1, 2])
2
nonzero_clusters([0, 1, 2, 0, 1, 2, 0])
2
nonzero_clusters([1, 2, 0, 1, 2, 0, 1, 2])
3
timing
a = np.random.choice((0, 1), 100000)
code
from itertools import groupby
def div(a):
m = a != 0
return (m[1:] > m[:-1]).sum() + m[0]
def pir(a):
return int(np.diff(np.concatenate([[0], a, [0]]) == 0).sum() / 2)
def jean(a):
previous = 0
count = 0
for c in a:
if previous==0 and c!=0:
count+=1
previous = c
return count
def moin(a):
return len([is_true for is_true, _ in groupby(a, lambda x: x!=0) if is_true])
def user(a):
return sum([1 for n in range (len (a) - 1) if not a[n] and a[n + 1]])
sum ([1 for n in range (len (a) - 1) if not a[n] and a[n + 1]])

Calculate percentage of count for a list of arrays

Simple problem, but I cannot seem to get it to work. I want to calculate the percentage a number occurs in a list of arrays and output this percentage accordingly.
I have a list of arrays which looks like this:
import numpy as np
# Create some data
listvalues = []
arr1 = np.array([0, 0, 2])
arr2 = np.array([1, 1, 2, 2])
arr3 = np.array([0, 2, 2])
listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)
listvalues
>[array([0, 0, 2]), array([1, 1, 2, 2]), array([0, 2, 2])]
Now I count the occurrences using collections, which returns a a list of collections.Counter:
import collections
counter = []
for i in xrange(len(listvalues)):
counter.append(collections.Counter(listvalues[i]))
counter
>[Counter({0: 2, 2: 1}), Counter({1: 2, 2: 2}), Counter({0: 1, 2: 2})]
The result I am looking for is an array with 3 columns, representing the value 0 to 2 and len(listvalues) of rows. Each cell should be filled with the percentage of that value occurring in the array:
# Result
66.66 0 33.33
0 50 50
33.33 0 66.66
So 0 occurs 66.66% in array 1, 0% in array 2 and 33.33% in array 3, and so on..
What would be the best way to achieve this?
Many thanks!
Here's an approach -
# Get lengths of each element in input list
lens = np.array([len(item) for item in listvalues])
# Form group ID array to ID elements in flattened listvalues
ID_arr = np.repeat(np.arange(len(lens)),lens)
# Extract all values & considering each row as an indexing perform counting
vals = np.concatenate(listvalues)
out_shp = [ID_arr.max()+1,vals.max()+1]
counts = np.bincount(ID_arr*out_shp[1] + vals)
# Finally get the percentages with dividing by group counts
out = 100*np.true_divide(counts.reshape(out_shp),lens[:,None])
Sample run with an additional fourth array in input list -
In [316]: listvalues
Out[316]: [array([0, 0, 2]),array([1, 1, 2, 2]),array([0, 2, 2]),array([4, 0, 1])]
In [317]: print out
[[ 66.66666667 0. 33.33333333 0. 0. ]
[ 0. 50. 50. 0. 0. ]
[ 33.33333333 0. 66.66666667 0. 0. ]
[ 33.33333333 33.33333333 0. 0. 33.33333333]]
The numpy_indexed package has a utility function for this, called count_table, which can be used to solve your problem efficiently as such:
import numpy_indexed as npi
arrs = [arr1, arr2, arr3]
idx = [np.ones(len(a))*i for i, a in enumerate(arrs)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(arrs))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)
You can get a list of all values and then simply iterate over the individual arrays to get the percentages:
values = set([y for row in listvalues for y in row])
print [[(a==x).sum()*100.0/len(a) for x in values] for a in listvalues]
You can create a list with the percentages with the following code :
percentage_list = [((counter[i].get(j) if counter[i].get(j) else 0)*10000)//len(listvalues[i])/100.0 for i in range(len(listvalues)) for j in range(3)]
After that, create a np array from that list :
results = np.array(percentage_list)
Reshape it so we have the good result :
results = results.reshape(3,3)
This should allow you to get what you wanted.
This is most likely not efficient, and not the best way to do this, but it has the merit of working.
Do not hesitate if you have any question.
I would like to use functional-paradigm to resolve this problem. For example:
>>> import numpy as np
>>> import pprint
>>>
>>> arr1 = np.array([0, 0, 2])
>>> arr2 = np.array([1, 1, 2, 2])
>>> arr3 = np.array([0, 2, 2])
>>>
>>> arrays = (arr1, arr2, arr3)
>>>
>>> u = np.unique(np.hstack(arrays))
>>>
>>> result = [[1.0 * c.get(uk, 0) / l
... for l, c in ((len(arr), dict(zip(*np.unique(arr, return_counts=True))))
... for arr in arrays)] for uk in u]
>>>
>>> pprint.pprint(result)
[[0.6666666666666666, 0.0, 0.3333333333333333],
[0.0, 0.5, 0.0],
[0.3333333333333333, 0.5, 0.6666666666666666]]

Categories