I need to find the index of the minimum per row in a 2-dim array which at the same time satifies additional constraint on the column values. Having two arrays a and b
a = np.array([[1,0,1],[0,0,1],[0,0,0],[1,1,1]])
b = np.array([[1,-1,2],[4,-1,1],[1,-1,2],[1,2,-1]])
the objective is to find the indicies for which holds that a == 1, b is positive and b is the minimumim value of the row. Fulfilling the first two conditions is easy
idx = np.where(np.logical_and(a == 1, b > 0))
which yields the indices:
(array([0, 0, 1, 3, 3]), array([0, 2, 2, 0, 1]))
Now I need to filter the duplicate row entries (stick with minimum value only) but I cannot think of an elegant way to achieve that. In the above example the result should be
(array([0,1,3]), array([0,2,0]))
edit:
It should also work for a containing other values than just 0 and 1.
Updated to trying to understand the problem better, try:
c = b*(b*a > 0)
np.where(c==np.min(c[np.nonzero(c)]))
Output:
(array([0, 1, 3], dtype=int64), array([0, 2, 0], dtype=int64))
Timings:
Method 1
a = np.array([[1,0,1],[0,0,1],[0,0,0],[1,1,1]])
b = np.array([[1,-1,2],[4,-1,1],[1,-1,2],[1,2,-1]])
b[b<0] = 100000
cond = [[True if i == b.argmin(axis=1)[k] else False for i in range(b.shape[1])] for k in range(b.shape[0])]
idx = np.where(np.logical_and(np.logical_and(a == 1, b > 0),cond))
idx
Method 2
c = b*(b*a > 0)
idx1 = np.where(c==np.min(c[np.nonzero(c)]))
idx1
Method 1 Timing:
28.3 µs ± 418 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Method 2 Timing:
12.2 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I found a solution based on list comprehension. It is necessary to change the negative values of b to some high value though.
a = np.array([[1,0,1],[0,0,1],[0,0,0],[1,1,1]])
b = np.array([[1,-1,2],[4,-1,1],[1,-1,2],[1,2,-1]])
b[b<0] = 100000
cond = [[True if i == b.argmin(axis=1)[k] else False for i in range(b.shape[1])] for k in range(b.shape[0])]
idx = np.where(np.logical_and(np.logical_and(a == 1, b > 0),cond))
print(idx)
(array([0, 1, 3]), array([0, 2, 0]))
Please let me hear what you think.
edit: I just noticed that this solution is horribly slow.
Related
How to can I compute variance without zero elements?
For example
np.var([[1, 1], [1, 2]], axis=1) -> [0, 0.25]
I need:
var([[1, 1, 0], [1, 2, 0]], axis=1) -> [0, 0.25]
Is it what your are looking for? You can filter out columns where all values are 0 (or at least one value is not 0).
m = np.array([[1, 1, 0], [1, 2, 0]])
np.var(m[:, np.any(m != 0, axis=0)], axis=1)
# Output
array([0. , 0.25])
V1
You can use a masked array:
data = np.array([[1, 1, 0], [1, 2, 0]])
np.ma.array(data, mask=(data == 0)).var(axis=1)
The result is
masked_array(data=[0. , 0.25],
mask=False,
fill_value=1e+20)
The raw numpy array is the data attribute of the resulting masked array:
>>> np.ma.array(data, mask=(data == 0)).var(axis=1).data
array([0. , 0.25])
V2
Without masked arrays, the operation of removing a variable number of elements in each row is a bit tricky. It would be simpler to implement the variance in terms of the formula sum(x**2) / N - (sum(x) / N)**2 and partial reduction of ufuncs.
First we need to find the split indices and segment lengths. In the general case, that looks like
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
Now you can operate on the raveled masked data:
mdata = data[data != 0]
mdata2 = mdata**2
var = np.add.reduceat(mdata2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
This gives you the same result for var (probably more efficiently than the masked version by the way):
array([0. , 0.25])
V3
The var function appears to use the more traditional formula (x - x.mean()).mean(). You can implement that using the quantities above with just a bit more work:
means = (np.add.reduceat(mdata, inds) / lens).repeat(lens)
var = np.add.reduceat((mdata - means)**2, inds) / lens
Comparison
Here is a quick benchmark for the two approaches:
def nzvar_v1(data):
return np.ma.array(data, mask=(data == 0)).var(axis=1).data
def nzvar_v2(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat(mdata**2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
def nzvar_v3(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat((mdata - (np.add.reduceat(mdata, inds) / lens).repeat(lens))**2, inds) / lens
np.random.seed(100)
data = np.random.randint(10, size=(1000, 1000))
%timeit nzvar_v1(data)
18.3 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v2(data)
5.89 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v3(data)
11.8 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So for a large dataset, the second approach, while requiring a bit more code, appears to be ~3x faster than masked arrays and ~2x faster than using the traditional formulation.
I need to check if an array A contains all elements of another array B. If not, output the missing elements. Both A and B are integers, and B is always from 0 to N with an interval of 1.
import numpy as np
A=np.array([1,2,3,6,7,8,9])
B=np.arange(10)
I know that I can use the following to check if there is any missing elements, but it does not give the index of the missing element.
np.all(elem in A for elem in B)
Is there a good way in python to output the indices of the missing elements?
IIUC you can try the following and assuming that B always is an "index" list:
[i for i in B if i not in A]
The output would be : [0, 4, 5]
Best way to do it with numpy
Numpy actually has a function to perform this : numpy.insetdiff1d
np.setdiff1d(B, A)
# Which returns
array([0, 4, 5])
You can use enumerate to get both index and content of a list. The following code would do what you want
idx = [idx for idx, element in enumerate(B) if element not in A]
I am assuming we want to get the elements exclusive to B, when compared to A.
Approach #1
Given the specific of B is always from 0 to N with an interval of 1, we can use a simple mask-based one -
mask = np.ones(len(B), dtype=bool)
mask[A] = False
out = B[mask]
Approach #2
Another one that edits B and would be more memory-efficient -
B[A] = -1
out = B[B>=0]
Approach #3
A more generic case of integers could be handled differently -
def setdiff_for_ints(B, A):
N = max(B.max(), A.max()) - min(min(A.min(),B.min()),0) + 1
mask = np.zeros(N, dtype=bool)
mask[B] = True
mask[A] = False
out = np.flatnonzero(mask)
return out
Sample run -
In [77]: A
Out[77]: array([ 1, 2, 3, 6, 7, 8, -6])
In [78]: B
Out[78]: array([1, 3, 4, 5, 7, 9])
In [79]: setdiff_for_ints(B, A)
Out[79]: array([4, 5, 9])
# Using np.setdiff1d to verify :
In [80]: np.setdiff1d(B, A)
Out[80]: array([4, 5, 9])
Timings -
In [81]: np.random.seed(0)
...: A = np.unique(np.random.randint(-10000,100000,1000000))
...: B = np.unique(np.random.randint(0,100000,1000000))
# #Hugolmn's soln with np.setdiff1d
In [82]: %timeit np.setdiff1d(B, A)
4.78 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [83]: %timeit setdiff_for_ints(B, A)
599 µs ± 6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Given a = [1, 2, 3, 4, 5]
After encoding, a' = [1, 1, 1, 1, 1], each element represents the difference compare to its previous element.
I know this can be done with
for i in range(len(a) - 1, 0, -1):
a[i] = a[i] - a[i - 1]
Is there a faster way? I am working with 2 billion numbers here, the process is taking about 30 minutes.
One way using itertools.starmap, islice and operator.sub:
from operator import sub
from itertools import starmap, islice
l = list(range(1, 10000000))
[l[0], *starmap(sub, zip(islice(l, 1, None), l))]
Output:
[1, 1, 1, ..., 1]
Benchmark:
l = list(range(1, 100000000))
# OP's method
%timeit [l[i] - l[i - 1] for i in range(len(l) - 1, 0, -1)]
# 14.2 s ± 373 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy approach by #ynotzort
%timeit np.diff(l)
# 8.52 s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip approach by #Nick
%timeit [nxt - cur for cur, nxt in zip(l, l[1:])]
# 7.96 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# itertool and operator approach by #Chris
%timeit [l[0], *starmap(sub, zip(islice(l, 1, None), l))]
# 6.4 s ± 255 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You could use zip to put together the list with an offset version and subtract those values
a = [1, 2, 3, 4, 5]
a[1:] = [nxt - cur for cur, nxt in zip(a, a[1:])]
print(a)
Output:
[1, 1, 1, 1, 1]
Out of interest, I ran this, the original code and #ynotzort answer through timeit and this was much faster than the numpy code for short lists; remaining faster up to about 10M values; both were about 30% faster than the original code. As the list size increased beyond 10M, the numpy code has more of a speed up and eventually is faster from about 20M values onward.
Update
Also tested the starmap code, and that is about 40% faster than the numpy code at 20M values...
Update 2
#Chris has some more comprehensive performance data in their answer. This answer can be sped up further (about 10%) by using itertools.islice to generate the offset list:
a = [a[0], *[nxt - cur for cur, nxt in zip(a, islice(a, 1, None))]]
You could use numpy.diff, For example:
import numpy as np
a = [1, 2, 3, 4, 5]
npa = np.array(a)
a_diff = np.diff(npa)
if i got this list
a = [1,0,0,1,0,0,0,1]
and I want it turned into
a = [1,0,0,2,0,0,0,3]
Setup for solution #1 and #2
from itertools import count
to_add = count()
a = [1,0,0,1,0,0,0,1]
Solution #1
>>> [x + next(to_add) if x else x for x in a]
[1, 0, 0, 2, 0, 0, 0, 3]
Solution #2, hacky but fun
>>> [x and x + next(to_add) for x in a]
[1, 0, 0, 2, 0, 0, 0, 3]
Setup for solution #3 and #4
import numpy as np
a = np.array([1,0,0,1,0,0,0,1])
Solution #3
>>> np.where(a == 0, 0, a.cumsum())
array([1, 0, 0, 2, 0, 0, 0, 3])
Solution #4 (my favorite one yet)
>>> a*a.cumsum()
array([1, 0, 0, 2, 0, 0, 0, 3])
All the cumsum solutions assume that the non-zero elements of a are all ones.
Timings:
# setup
>>> a = [1, 0, 0, 1, 0, 0, 0, 1]*1000
>>> arr = np.array(a)
>>> to_add1, to_add2 = count(), count()
# IPython timings # i5-6200U CPU # 2.30GHz (though only relative times are of interest)
>>> %timeit [x + next(to_add1) if x else x for x in a] # solution 1
669 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit [x and x + next(to_add2) for x in a] # solution 2
673 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit np.where(arr == 0, 0, arr.cumsum()) # solution 3
34.7 µs ± 94.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit arr = np.array(a); np.where(arr == 0, 0, arr.cumsum()) # solution 3 with array creation
474 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit arr*arr.cumsum() # solution 4
23.6 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit arr = np.array(a); arr*arr.cumsum() # solution 4 with array creation
465 µs ± 6.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here is how I would do it:
def increase(l):
count = 0
for num in l:
if num == 1:
yield num + count
count += 1
else:
yield num
c = list(increase(a))
c
[1, 0, 0, 2, 0, 0, 0, 3]
So, you want to increase each 1 except for the first one, right?
How about:
a = [1,0,0,1,0,0,0,1]
current_number = 0
for i, num in enumerate(a):
if num == 1:
a[i] = current_number + 1
current_number += 1
print(a)
>>> [1, 0, 0, 2, 0, 0, 0, 3]
Or, if you prefer:
current_number = 1
for i, num in enumerate(a):
if num == 1:
a[i] = current_number
current_number += 1
Use a list comprehension for this:
print([a[i]+a[:i].count(1) if a[i]==1 else a[i] for i in range(len(a))])
Output:
[1, 0, 0, 2, 0, 0, 0, 3]
Loop version:
for i in range(len(a)):
if a[i]==1:
a[i]=a[i]+a[:i].count(1)
Using numpy cumsum or cumulative sum to replace 1's to sum of 1's
In [4]: import numpy as np
In [5]: [i if i == 0 else j for i, j in zip(a, np.cumsum(a))]
Out[5]: [1, 0, 0, 2, 0, 0, 0, 3]
Other option: a one liner list comprehension, no dependencies.
[ 0 if e == 0 else sum(a[:i+1]) for i, e in enumerate(a) ]
#=> [1, 0, 0, 2, 0, 0, 0, 3]
I have multiple numpy arrays and I want to create new arrays doing something that is like an XOR ... but not quite.
My input is two arrays, array1 and array2.
My output is a modified (or new array, I don't really care) version of array1.
The modification is elementwise, by doing the following:
1.) If either array has 0 for the given index, then the index is left unchanged.
2.) If array1 and array2 are nonzero, then the modified array is assigned the value of array1's index subtracted by array2's index, down to a minimum of zero.
Examples:
array1: [0, 3, 8, 0]
array2: [1, 1, 1, 1]
output: [0, 2, 7, 0]
array1: [1, 1, 1, 1]
array2: [0, 3, 8, 0]
output: [1, 0, 0, 1]
array1: [10, 10, 10, 10]
array2: [8, 12, 8, 12]
output: [2, 0, 2, 0]
I would like to be able to do this with say, a single numpy.copyto statement, but I don't know how. Thank you.
edit:
it just hit me. could I do:
new_array = np.zeros(size_of_array1)
numpy.copyto(new_array, array1-array2, where=array1>array2)
Edit 2: Since I have received several answers very quickly I am going to time the different answers against each other to see how they do. Be back with results in a few minutes.
Okay, results are in:
array of random ints 0 to 5, size = 10,000, 10 loops
1.)using my np.copyto method
2.)using clip
3.)using maximum
0.000768184661865
0.000391960144043
0.000403165817261
Kasramvd also provided some useful timings below
You can use a simple subtraction and clipping the result with zero as the min:
(arr1 - arr2).clip(min=0)
Demo:
In [43]: arr1 = np.array([0,3,8,0]); arr2 = np.array([1,1,1,1])
In [44]: (arr1 - arr2).clip(min=0)
Out[44]: array([0, 2, 7, 0])
On large arrays it's also faster than maximum approach:
In [51]: arr1 = np.arange(10000); arr2 = np.arange(10000)
In [52]: %timeit np.maximum(0, arr1 - arr2)
22.3 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [53]: %timeit (arr1 - arr2).clip(min=0)
20.9 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [54]: arr1 = np.arange(100000); arr2 = np.arange(100000)
In [55]: %timeit np.maximum(0, arr1 - arr2)
671 µs ± 5.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [56]: %timeit (arr1 - arr2).clip(min=0)
648 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note that if it's possible for arr2 to have negative values you should consider using an abs function on arr2 to get the expected result:
(arr1 - abs(arr2)).clip(min=0)
In [73]: np.maximum(0,np.array([0,3,8,0])-np.array([1,1,1,1]))
Out[73]: array([0, 2, 7, 0])
This doesn't explicitly address
If either array has 0 for the given index, then the index is left unchanged.
but the results match for all examples:
In [74]: np.maximum(0,np.array([1,1,1,1])-np.array([0,3,8,0]))
Out[74]: array([1, 0, 0, 1])
In [75]: np.maximum(0,np.array([10,10,10,10])-np.array([8,12,8,12]))
Out[75]: array([2, 0, 2, 0])
You can first simply subtract the arrays and then use boolean array indexing on the subtracted result to assign 0 where there are negative values as in:
# subtract
In [43]: subtracted = arr1 - arr2
# get a boolean mask by checking for < 0
# index into the array and assign 0
In [44]: subtracted[subtracted < 0] = 0
In [45]: subtracted
Out[45]: array([0, 2, 7, 0])
Applying the same for the other inputs specified by OP:
In [46]: arr1 = np.array([1, 1, 1, 1])
...: arr2 = np.array([0, 3, 8, 0])
In [47]: subtracted = arr1 - arr2
In [48]: subtracted[subtracted < 0] = 0
In [49]: subtracted
Out[49]: array([1, 0, 0, 1])
And for the third input arrays:
In [50]: arr1 = np.array([10, 10, 10, 10])
...: arr2 = np.array([8, 12, 8, 12])
In [51]: subtracted = arr1 - arr2
In [52]: subtracted[subtracted < 0] = 0
In [53]: subtracted
Out[53]: array([2, 0, 2, 0])