Related
I have the following list or numpy array
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
and an additional list indicating the positions of non-zeros
i=[0,6,9]
I would like to make two new lists out of them, one filling the zeros and one counting in between, for this short example:
a=[7.2,7.2,7.2,7.2,7.2,7.2,6.5,6.5,6.5,-8.1,-8.1,-8.1,-8.1,-8.1]
b=[0,1,2,3,4,5,0,1,2,0,1,2,3,4]
Is therea a way to do that without a for loop to speed up things, as the list ll is quite long in my case.
Array a is the result of a forward fill and array b are indices associated with the range between each consecutive non-zero element.
pandas has a forward fill function, but it should be easy enough to compute with numpy and there are many sources on how to do this.
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
a = np.array(ll)
# find zero elements and associated index
mask = a == 0
idx = np.where(~mask, np.arange(mask.size), False)
# do the fill
a[np.maximum.accumulate(idx)]
output:
array([ 7.2, 7.2, 7.2, 7.2, 7.2, 7.2, 6.5, 6.5, 6.5, -8.1, -8.1,
-8.1, -8.1, -8.1])
More information about forward fill is found here:
Most efficient way to forward-fill NaN values in numpy array
Finding the consecutive zeros in a numpy array
Computing array b you could use the forward fill mask and combine it with a single np.arange:
fill_mask = np.maximum.accumulate(idx)
np.arange(len(fill_mask)) - fill_mask
output:
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 0, 1, 2, 3, 4])
So...
import numpy as np
ll = np.array([7.2, 0, 0, 0, 0, 0, 6.5, 0, 0, -8.1, 0, 0, 0, 0])
i = np.array([0, 6, 9])
counts = np.append(
np.diff(i), # difference between each element in i
# (i element shorter than i)
len(ll) - i[-1], # + length of last repeat
)
repeated = np.repeat(ll[i], counts)
repeated becomes
[ 7.2 7.2 7.2 7.2 7.2 7.2 6.5 6.5 6.5 -8.1 -8.1 -8.1 -8.1 -8.1]
b could be computed with
b = np.concatenate([np.arange(c) for c in counts])
print(b)
# [0 1 2 3 4 5 0 1 2 0 1 2 3 4]
but that involves a loop in the form of that list comprehension; perhaps someone Numpyier could implement it without a Python loop.
This is an extension of the question posed here (quoted below)
I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll
values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing
tricks?
The accepted solution was:
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
result = A[rows, column_indices]
I would basically like to do the same thing, except when an index gets rolled "past" the end of the row, I would like the other side of the row to be padded with a NaN, rather than the value move to the "front" of the row in a periodic fashion.
Maybe using np.pad somehow? But I can't figure out how to get that to pad different rows by different amounts.
Inspired by Roll rows of a matrix independently's solution, here's a vectorized one based on np.lib.stride_tricks.as_strided -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),np.nan)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Sample run -
In [76]: a
Out[76]:
array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
In [77]: r
Out[77]: array([ 2, 0, -1])
In [78]: strided_indexing_roll(a, r)
Out[78]:
array([[nan, nan, 4.],
[ 1., 2., 3.],
[ 0., 5., nan]])
I was able to hack this together with linear indexing...it gets the right result but performs rather slowly on large arrays.
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]]).astype(float)
r = np.array([2, 0, -1])
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r_old = r.copy()
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
result = A[rows, column_indices]
# replace with NaNs
row_length = result.shape[-1]
pad_inds = []
for ind,i in np.enumerate(r_old):
if i > 0:
inds2pad = [np.ravel_multi_index((ind,) + (j,),result.shape) for j in range(i)]
pad_inds.extend(inds2pad)
if i < 0:
inds2pad = [np.ravel_multi_index((ind,) + (j,),result.shape) for j in range(row_length+i,row_length)]
pad_inds.extend(inds2pad)
result.ravel()[pad_inds] = nan
Gives the expected result:
print result
[[ nan nan 4.]
[ 1. 2. 3.]
[ 0. 5. nan]]
Based on #Seberg and #yann-dubois answers in the non-nan case, I've written a method that:
Is faster than the current answer
Works on ndarrays of any shape (specify the row-axis using the axis argument)
Allows for setting fill to either np.nan, any other "fill value" or False to allow regular rolling across the array edge.
Benchmarking
cols, rows = 1024, 2048
arr = np.stack(rows*(np.arange(cols,dtype=float),))
shifts = np.random.randint(-cols, cols, rows)
np.testing.assert_array_almost_equal(row_roll(arr, shifts), strided_indexing_roll(arr, shifts))
# True
%timeit row_roll(arr, shifts)
# 25.9 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit strided_indexing_roll(arr, shifts)
# 29.7 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def row_roll(arr, shifts, axis=1, fill=np.nan):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray, dtype int. Shape: `(arr.shape[:axis],)`.
Amount to roll each row by. Positive shifts row right.
axis : int
Axis along which elements are shifted.
fill: bool or float
If True, value to be filled at missing values. Otherwise just rolls across edges.
"""
if np.issubdtype(arr.dtype, int) and isinstance(fill, float):
arr = arr.astype(float)
shifts2 = shifts.copy()
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts2[shifts2 < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts2[:, np.newaxis]
result = arr[tuple(all_idcs)]
if fill is not False:
# Create mask of row positions above negative shifts
# or below positive shifts. Then set them to np.nan.
*_, nrows, ncols = arr.shape
mask_neg = shifts < 0
mask_pos = shifts >= 0
shifts_pos = shifts.copy()
shifts_pos[mask_neg] = 0
shifts_neg = shifts.copy()
shifts_neg[mask_pos] = ncols+1 # need to be bigger than the biggest positive shift
shifts_neg[mask_neg] = shifts[mask_neg] % ncols
indices = np.stack(nrows*(np.arange(ncols),))
nanmask = (indices < shifts_pos[:, None]) | (indices >= shifts_neg[:, None])
result[nanmask] = fill
arr = np.swapaxes(result,-1,axis)
return arr
...and that reference comes from a separate matrix.
This question is an extension of an earlier answered question where the reference element came directly from the same column it was being compared against. Some clever sorting and referencing the index of the sort seemed to solve that one.
Broadcasting has been suggested in both the original and this new question. I run out of memory at around n ~ 3000 and need another order of magnitude larger yet.
The Target ( Production-grade ) Scaling Definitions:
So as to let proposed solutions' approaches fair and mutually comparable, both in the [SPACE]- and the [TIME]-domains,
let's assume n = 50000; m = 20; k = 50; a = np.random.rand( n, m ); ...
I'm now interested in a more general form where the reference value comes from another matrix of reference values.
Original question:
Vectorized pythonic way to get count of elements greater than current element
New question: Can we write a vectorized form to perform the following role.
Function receives as input 2 2-d arrays.
A = n x m
B = k x m
and returns
C = k x m.
C[i,j] is the proportion of observations in A[:,j] ( just the j-th column ) that are larger than B[i,j]
Here is my embarrasingly slow double for loop implementation.
import numpy as np
n = 100
m = 20
k = 50
a = np.random.rand(n,m)
b = np.random.rand(k,m)
c = np.zeros((k,m))
for j in range(0,m): #cols
for i in range(0,k): # rows
r = b[i,j]
c[i,j] = ( ( a[:,j] > r).sum() ) / (n)
Approach #1
We could again use the argsort trick as discussed in this solution but in a bit twisted manner. We would concatenate the second array into the first array and then perform argsort-ing. We need to use argsort for both the concatenated array and the second one and have our desired output. The implementation would look something like this -
ab = np.vstack((a,b))
len_a, len_b = len(a), len(b)
b_argoffset = b.argsort(0).argsort(0)
total_args = ab.argsort(0).argsort(0)[-len_b:]
out = len_a - total_args + b_argoffset
Explanation
Concatenate second array whose values are to be computed into the first array.
Now, since we are appending, we would have their index positions later on, after the first array length has ended.
We use one argsort to get the relative positions of the second array w.r.t to the entire concatenated array and one more argsort to trace back those indices w.r.t the original order.
We need to repeat the double argsort-ing for the second array on itself, so as to compensate for the concatenation.
These indices are for each element in b with the comparison : a[:,j] > b[i,j]. Now, these indices orders are 0-based, i.e. an index closer to 0 represent greater number of elements in a[:,j] than the current element b[i,j], so a greater count and vice versa. So, we need to subtract those indices from the length of a[:,j] for the final output.
Approach #1 - Improvement
We would optimize it further by using array-assignment, again inspired by Approach #2 from the same solution. So, those arg outputs : b_argoffset and total_args could be alternatively computed, like so -
def unqargsort(a):
n,m = a.shape
idx = a.argsort(0)
out = np.zeros((n,m),dtype=int)
out[idx, np.arange(m)] = np.arange(n)[:,None]
return out
b_argoffset = unqargsort(b)
total_args = unqargsort(ab)[-len_b:]
Approach #2
We could also leverage searchsorted for an altogether different approach -
k,m = b.shape
sidx = a.argsort(0)
out = np.empty((k,m), dtype=int)
for i in range(m): #cols
out[:,i] = np.searchsorted(a[:,i], b[:,i],sorter=sidx[:,i])
out = len(a) - out
Explanation
We get the sorted order indices for each column of a.
Then, use those indices to get how we could place values off b into the sorted a with searcshorted. This gives us same as the output from step#3,4 in Approach#1.
Note that these approaches give us the count. So, for the final output, divide the output thus obtained by n.
I think you can use broadcasting:
c = (a[:,None,:] > b).mean(axis=0)
Demo:
In [207]: n = 5
In [208]: m = 3
In [209]: a = np.random.randint(10, size=(n,m))
In [210]: b = np.random.randint(10, size=(n,m))
In [211]: c = np.zeros((n,m))
In [212]: a
Out[212]:
array([[2, 2, 8],
[5, 0, 8],
[2, 5, 7],
[4, 4, 4],
[2, 6, 7]])
In [213]: b
Out[213]:
array([[3, 6, 8],
[2, 7, 5],
[8, 9, 2],
[9, 8, 7],
[2, 7, 2]])
In [214]: for j in range(0,m): #cols
...: for i in range(0,n): # rows
...: r = b[i,j]
...: c[i,j] = ( ( a[:,j] > r).sum() ) / (n)
...:
...:
In [215]: c
Out[215]:
array([[0.4, 0. , 0. ],
[0.4, 0. , 0.8],
[0. , 0. , 1. ],
[0. , 0. , 0.4],
[0.4, 0. , 1. ]])
In [216]: (a[:,None,:] > b).mean(axis=0)
Out[216]:
array([[0.4, 0. , 0. ],
[0.4, 0. , 0.8],
[0. , 0. , 1. ],
[0. , 0. , 0.4],
[0.4, 0. , 1. ]])
check:
In [217]: ((a[:,None,:] > b).mean(axis=0) == c).all()
Out[217]: True
I have a 2d array a and a 1d array b. I want to compute the sum of rows in array a group by each id in b. For example:
import numpy as np
a = np.array([[1,2,3],[2,3,4],[4,5,6]])
b = np.array([0,1,0])
count = len(b)
ls = list(set(b))
res = np.zeros((len(ls),a.shape[1]))
for i in ls:
res[i] = np.array([a[x] for x in range(0,count) if b[x] == i]).sum(axis=0)
print res
I got the printed result as:
[[ 5. 7. 9.]
[ 2. 3. 4.]]
What I want to do is, since the 1st and 3rd elements of b are 0, I perform a[0]+a[2], which is [5, 7, 9] as one row of the results. Similarly, the 2nd element of b is 1, so that I perform a[1], which is [2, 3, 4] as another row of the results.
But it seems my implementation is quite slow for large array. Is there any better implementation?
I know there is a bincount function in numpy. But it seems only supports 1d array.
Thank you all for helping me!
The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:
import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)
Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.
import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)
Approach #1
You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -
np.add.at(res, b, a)
Sample run -
In [40]: a
Out[40]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [41]: b
Out[41]: array([0, 1, 0])
In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)
In [46]: np.add.at(res, b, a)
In [47]: res
Out[47]:
array([[5, 7, 9],
[2, 3, 4]])
To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -
In [49]: res/np.bincount(b)[:,None].astype(float)
Out[49]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -
def groupby_addat(a, b, out="sum"):
unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
np.add.at(res, tags, a)
if out=="mean":
return unqb, res/counts[:,None].astype(float)
elif out=="sum":
return unqb, res
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [204]: b_ids, means = groupby_addat(a, b, out="mean")
In [205]: b_ids
Out[205]: array([ 5, 10])
In [206]: means
Out[206]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Approach #2
We could also make use of np.add.reduceat and might be more performant -
def groupby_addreduceat(a, b, out="sum"):
sidx = b.argsort()
sb = b[sidx]
spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
sums = np.add.reduceat(a[sidx],spt_idx[:-1])
if out=="mean":
counts = spt_idx[1:] - spt_idx[:-1]
return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
elif out=="sum":
return sb[spt_idx[:-1]], sums
else:
print "Invalid output"
return None
Sample run -
In [201]: a
Out[201]:
array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]])
In [202]: b
Out[202]: array([ 5, 10, 5])
In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")
In [208]: b_ids
Out[208]: array([ 5, 10])
In [209]: means
Out[209]:
array([[ 2.5, 3.5, 4.5],
[ 2. , 3. , 4. ]])
Looking for a fast vectorized function that returns the rolling number of consecutive non-zero values. The count should start over at 0 whenever encountering a zero. The result should have the same shape as the input array.
Given an array like this:
x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
The function should return this:
array([1, 2, 3, 0, 0, 1, 0, 1, 2])
This post lists a vectorized approach which basically consists of two steps:
Initialize a zeros vector of the same size as input vector, x and set ones at places corresponding to non-zeros of x.
Next up, in that vector, we need to put minus of runlengths of each island right after the ending/stop positions for each "island". The intention is to use cumsum again later on, which would result in sequential numbers for the "islands" and zeros elsewhere.
Here's the implementation -
import numpy as np
#Append zeros at the start and end of input array, x
xa = np.hstack([[0],x,[0]])
# Get an array of ones and zeros, with ones for nonzeros of x and zeros elsewhere
xa1 =(xa!=0)+0
# Find consecutive differences on xa1
xadf = np.diff(xa1)
# Find start and stop+1 indices and thus the lengths of "islands" of non-zeros
starts = np.where(xadf==1)[0]
stops_p1 = np.where(xadf==-1)[0]
lens = stops_p1 - starts
# Mark indices where "minus ones" are to be put for applying cumsum
put_m1 = stops_p1[[stops_p1 < x.size]]
# Setup vector with ones for nonzero x's, "minus lens" at stops +1 & zeros elsewhere
vec = xa1[1:-1] # Note: this will change xa1, but it's okay as not needed anymore
vec[put_m1] = -lens[0:put_m1.size]
# Perform cumsum to get the desired output
out = vec.cumsum()
Sample run -
In [116]: x
Out[116]: array([ 0. , 2.3, 1.2, 4.1, 0. , 0. , 5.3, 0. , 1.2, 3.1, 0. ])
In [117]: out
Out[117]: array([0, 1, 2, 3, 0, 0, 1, 0, 1, 2, 0], dtype=int32)
Runtime tests -
Here's some runtimes tests comparing the proposed approach against the other itertools.groupby based approach -
In [21]: N = 1000000
...: x = np.random.rand(1,N)
...: x[x>0.5] = 0.0
...: x = x.ravel()
...:
In [19]: %timeit sumrunlen_vectorized(x)
10 loops, best of 3: 19.9 ms per loop
In [20]: %timeit sumrunlen_loopy(x)
1 loops, best of 3: 2.86 s per loop
You can use itertools.groupby and np.hstack :
>>> import numpy as np
>>> x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
>>> from itertools import groupby
>>> np.hstack([[i if j!=0 else j for i,j in enumerate(g,1)] for _,g in groupby(x,key=lambda x: x!=0)])
array([ 1., 2., 3., 0., 0., 1., 0., 1., 2.])
We can group the array elements based on non-zero elements then use a list comprehension and enumerate to replace the non-zero sub-arrays with those index then flatten the list with np.hstack.
This sub-problem came up in Kick Start 2021 Round A for me. My solution:
def current_run_len(a):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
a_[stops + 1] = -(stops - starts) # +1 for behind-last
return a_[1:-1].cumsum()
In fact, the problem also required a version where you count down consecutive sequences. Thus here another version with an optional keyword argument which does the same for rev=False:
def current_run_len(a, rev=False):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
if rev:
a_[starts] = -(stops - starts)
cs = -a_.cumsum()[:-2]
else:
a_[stops + 1] = -(stops - starts) # +1 for behind-last
cs = a_.cumsum()[1:-1]
return cs
Results:
a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1])
print('a = ', a)
print('current_run_len(a) = ', current_run_len(a))
print('current_run_len(a, rev=True) = ', current_run_len(a, rev=True))
a = [1 1 1 1 0 0 0 1 1 0 1 0 0 0 1]
current_run_len(a) = [1 2 3 4 0 0 0 1 2 0 1 0 0 0 1]
current_run_len(a, rev=True) = [4 3 2 1 0 0 0 2 1 0 1 0 0 0 1]
For an array that consists of 0s and 1s only, you can simplify [0, a != 0, 0] to [0, a, 0]. But the version as-posted also works for arbitrary non-zero numbers.