Remove consecutive duplicates in a NumPy array - python

I would like to remove duplicates which follow each other, but not duplicates along the whole array. Also, I want to keep the ordering unchanged.
So if the input is [0 0 1 3 2 2 3 3] the output should be [0 1 3 2 3]
I found a way using itertools.groupby() but I am looking for a faster NumPy solution.

a[np.insert(np.diff(a).astype(np.bool), 0, True)]
Out[99]: array([0, 1, 3, 2, 3])
The general idea is to use diff to find the difference between two consecutive elements in the array. Then we only index those which give non-zero differences elements. But since the length of diff is shorter by 1. So before indexing, we need to insert the True to the beginning of the diff array.
Explanation:
In [100]: a
Out[100]: array([0, 0, 1, 3, 2, 2, 3, 3])
In [101]: diff = np.diff(a).astype(np.bool)
In [102]: diff
Out[102]: array([False, True, True, True, False, True, False], dtype=bool)
In [103]: idx = np.insert(diff, 0, True)
In [104]: idx
Out[104]: array([ True, False, True, True, True, False, True, False], dtype=bool)
In [105]: a[idx]
Out[105]: array([0, 1, 3, 2, 3])

For pure python wich also works with numpy arrays use this:
def modify(l):
last = None
for e in l:
if e != last:
yield e
last = e
pure = modify([0, 0, 1, 3, 2, 2, 3, 3])
import numpy
num = numpy.array(modify(numpy.array([0, 0, 1, 3, 2, 2, 3, 3])))
I don't know if there are any numpy functions wich would speed this up.

For NumPy version >= 1.16.0 you can use the prepend argument:
a[np.diff(a, prepend=np.nan).astype(bool)]

Related

Understanding numpy.where

I want to get first index of numpy array element which is greater than some specific element of that same array. I tried following:
>>> Q5=[[1,2,3],[4,5,6]]
>>> Q5 = np.array(Q5)
>>> Q5[0][Q5>Q5[0,0]]
array([2, 3])
>>> np.where(Q5[0]>Q5[0,0])
(array([1, 2], dtype=int32),)
>>> np.where(Q5[0]>Q5[0,0])[0][0]
1
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
I am more concerned with np.where(Q5[0]>Q5[0,0]) returning tuple (array([1, 2], dtype=int32),) and hence requiring me to double index [0][0] at the end of np.where(Q5[0]>Q5[0,0])[0][0].
Q2. Why this return tuple, but below returns proper numpy array?
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)
array([-1, 2, 3])
So that I can index directly:
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)[1]
2
In [58]: A = np.arange(1,10).reshape(3,3)
In [59]: A.shape
Out[59]: (3, 3)
In [60]: A
Out[60]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
np.where with just the condition is really np.nonzero.
Generate a boolean array:
In [63]: A==6
Out[63]:
array([[False, False, False],
[False, False, True],
[False, False, False]])
Find where that is true:
In [64]: np.nonzero(A==6)
Out[64]: (array([1]), array([2]))
The result is a tuple, one element per dimension of the condition. Each element is an indexing array, together they define the location of the True(s)
Another test with several True
In [65]: (A%3)==1
Out[65]:
array([[ True, False, False],
[ True, False, False],
[ True, False, False]])
In [66]: np.nonzero((A%3)==1)
Out[66]: (array([0, 1, 2]), array([0, 0, 0]))
Using the tuple to index the original array:
In [67]: A[np.nonzero((A%3)==1)]
Out[67]: array([1, 4, 7])
Using the 3 argument where to create a new array with a mix of values from A and A+10
In [68]: np.where((A%3)==1,A+10, A)
Out[68]:
array([[11, 2, 3],
[14, 5, 6],
[17, 8, 9]])
If the condition has multiple True, nonzero isn't the test tool for finding the "first", since it necessarily finds all.
The nonzero tuple can be turned into a 2d array with a transpose. It actually may be easier to get the "first" from this array:
In [73]: np.argwhere((A%3)==1)
Out[73]:
array([[0, 0],
[1, 0],
[2, 0]])
You are looking in a 1d array, a row of A:
In [77]: A[0]>A[0,0]
Out[77]: array([False, True, True])
In [78]: np.nonzero(A[0]>A[0,0])
Out[78]: (array([1, 2]),) # 1 element tuple
In [79]: np.argwhere(A[0]>A[0,0])
Out[79]:
array([[1],
[2]])
In [81]: np.where(A[0]>A[0,0], 100, 0) # 3 argument where
Out[81]: array([ 0, 100, 100])
So whether you are searching a 1d array or a 2d (or 3 or 4), nonzero returns a tuple with one array element per dimension. That way it can always be used to index a like sized array. The 1d tuple might look redundant, but it is consistent with other dimensional results.
When trying understand operations like this, read the docs carefully, and look at individual steps. Here I look at the conditional matrix, the nonzero result, and its various uses.
Using argmax with a boolean array will give you the index of the first True.
In [54]: q
Out[54]:
array([[1, 2, 3],
[4, 5, 6]])
In [55]: q > q[0,0]
Out[55]:
array([[False, True, True],
[ True, True, True]], dtype=bool)
argmax can take an axis/dimension argument.
In [56]: np.argmax(q > q[0,0], 0)
Out[56]: array([1, 0, 0], dtype=int64)
That says the first True is index one for column zero and index zero for columns one and two.
In [57]: np.argmax(q > q[0,0], 1)
Out[57]: array([1, 0], dtype=int64)
That says the first True is index one for row zero and index zero for row one.
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
No I would use argmax with 1 for the axis argument then select the first item from that result.
Q2. Why this return tuple
You told it to return -1 for False values and return Q5[0] items for True values.
Q2 ...but below returns proper numpy array?
You got lucky and chose the correct index.
numpy.where() is like a for loop with an if.
numpy.where(condition, values, new_value)
condition - just like if conditions.
values - The values to iterate on
new_value - if the condition is true for a value, its going to change to the new_value
If we would like to write it for a 1-dimensional array it should look something like this:
[xv if c else yv for c, xv, yv in zip(condition, x, y)]
Example:
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
First we create an array with numbers from 0 to 9 (0, 1, 2 ... 7, 8, 9)
and then we are checking for all the values in the array that are greater from 5 and multiplying their value by 10.
So now all the values in the array that are less then 5 stayed the same and all the values that are greater multiplied by 10

How to use conditional statements or loops to achieve my requirement?

I am new to python coding. Kindly, help me to achieve my requirement.
Suppose there are two arrays 'a' and 'b' of size 3*4
a = [[1,0,0,1],
[0,0,1,1],
[1,0,0,1]]
b = [[12,-34,-10,4],
[2,11,-12,20],
[-12,16,19,-9]]
Here, if b[i,j]<10 than I want the corresponding a[i,j] to be same(i.e it can be either 0 or 1) else change a[i,j] element to 1.
Expected outcome for the above example :
c = [[1,0,0,1],
[0,1,1,1],
[1,1,1,1]]
You can use the or | operator:
In [11]: b >= 10
Out[11]:
array([[ True, False, False, False],
[False, True, False, True],
[False, True, True, False]])
In [12]: a | (b >= 10)
Out[12]:
array([[1, 0, 0, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
The | is a bitwise or and is equivalent to np.bitwise_or:
In [13]: np.bitwise_or(a, b >= 10)
Out[13]:
array([[1, 0, 0, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
This assumes both a and b are numpy arrays, you can make this so with the array constructor:
a, b = np.array(a), np.array(b)
if you do not want to use numpy you could do this nested list-comprehension:
c = [[el_a | (el_b >= 10) for el_a, el_b in zip(row_a, row_b)]
for row_a, row_b in zip(a, b)]
but i prefer Andy Hayden's anser. numpy really shines for that kind of operations.

2d numpy mask not working as expected

I'm trying to turn a 2x3 numpy array into a 2x2 array by removing select indexes.
I think I can do this with a mask array with true/false values.
Given
[ 1, 2, 3],
[ 4, 1, 6]
I want to remove one element from each row to give me:
[ 2, 3],
[ 4, 6]
However this method isn't working quite like I would expect:
import numpy as np
in_array = np.array([
[ 1, 2, 3],
[ 4, 1, 6]
])
mask = np.array([
[False, True, True],
[True, False, True]
])
print in_array[mask]
Gives me:
[2 3 4 6]
Which is not what I want. Any ideas?
The only thing 'wrong' with that is it is the shape - 1d rather than 2. But what if your mask was
mask = np.array([
[False, True, False],
[True, False, True]
])
1 value in the first row, 2 in second. It couldn't return that as a 2d array, could it?
So the default behavior when masking like this is to return a 1d, or raveled result.
Boolean indexing like this is effectively a where indexing:
In [19]: np.where(mask)
Out[19]: (array([0, 0, 1, 1], dtype=int32), array([1, 2, 0, 2], dtype=int32))
In [20]: in_array[_]
Out[20]: array([2, 3, 4, 6])
It finds the elements of the mask which are true, and then selects the corresponding elements of the in_array.
Maybe the transpose of where is easier to visualize:
In [21]: np.argwhere(mask)
Out[21]:
array([[0, 1],
[0, 2],
[1, 0],
[1, 2]], dtype=int32)
and indexing iteratively:
In [23]: for ij in np.argwhere(mask):
...: print(in_array[tuple(ij)])
...:
2
3
4
6

How can you turn an index array into a mask array in Numpy?

Is it possible to convert an array of indices to an array of ones and zeros, given the range?
i.e. [2,3] -> [0, 0, 1, 1, 0], in range of 5
I'm trying to automate something like this:
>>> index_array = np.arange(200,300)
array([200, 201, ... , 299])
>>> mask_array = ??? # some function of index_array and 500
array([0, 0, 0, ..., 1, 1, 1, ... , 0, 0, 0])
>>> train(data[mask_array]) # trains with 200~299
>>> predict(data[~mask_array]) # predicts with 0~199, 300~499
Here's one way:
In [1]: index_array = np.array([3, 4, 7, 9])
In [2]: n = 15
In [3]: mask_array = np.zeros(n, dtype=int)
In [4]: mask_array[index_array] = 1
In [5]: mask_array
Out[5]: array([0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0])
If the mask is always a range, you can eliminate index_array, and assign 1 to a slice:
In [6]: mask_array = np.zeros(n, dtype=int)
In [7]: mask_array[5:10] = 1
In [8]: mask_array
Out[8]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
If you want an array of boolean values instead of integers, change the dtype of mask_array when it is created:
In [11]: mask_array = np.zeros(n, dtype=bool)
In [12]: mask_array
Out[12]:
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False], dtype=bool)
In [13]: mask_array[5:10] = True
In [14]: mask_array
Out[14]:
array([False, False, False, False, False, True, True, True, True,
True, False, False, False, False, False], dtype=bool)
For a single dimension, try:
n = (15,)
index_array = [2, 5, 7]
mask_array = numpy.zeros(n)
mask_array[index_array] = 1
For more than one dimension, convert your n-dimensional indices into one-dimensional ones, then use ravel:
n = (15, 15)
index_array = [[1, 4, 6], [10, 11, 2]] # you may need to transpose your indices!
mask_array = numpy.zeros(n)
flat_index_array = np.ravel_multi_index(
index_array,
mask_array.shape)
numpy.ravel(mask_array)[flat_index_array] = 1
There's a nice trick to do this as a one-liner, too - use the numpy.in1d and numpy.arange functions like this (the final line is the key part):
>>> x = np.linspace(-2, 2, 10)
>>> y = x**2 - 1
>>> idxs = np.where(y<0)
>>> np.in1d(np.arange(len(x)), idxs)
array([False, False, False, True, True, True, True, False, False, False], dtype=bool)
The downside of this approach is that it's ~10-100x slower than the appropch Warren Weckesser gave... but it's a one-liner, which may or may not be what you're looking for.
As requested, here it is in an answer. The code:
[x in index_array for x in range(500)]
will give you a mask like you asked for, but it will use Bools instead of 0's and 1's.

How to count the number of true elements in a NumPy bool array

I have a NumPy array 'boolarr' of boolean type. I want to count the number of elements whose values are True. Is there a NumPy or Python routine dedicated for this task? Or, do I need to iterate over the elements in my script?
You have multiple options. Two options are the following.
boolarr.sum()
numpy.count_nonzero(boolarr)
Here's an example:
>>> import numpy as np
>>> boolarr = np.array([[0, 0, 1], [1, 0, 1], [1, 0, 1]], dtype=np.bool)
>>> boolarr
array([[False, False, True],
[ True, False, True],
[ True, False, True]], dtype=bool)
>>> boolarr.sum()
5
Of course, that is a bool-specific answer. More generally, you can use numpy.count_nonzero.
>>> np.count_nonzero(boolarr)
5
That question solved a quite similar question for me and I thought I should share :
In raw python you can use sum() to count True values in a list :
>>> sum([True,True,True,False,False])
3
But this won't work :
>>> sum([[False, False, True], [True, False, True]])
TypeError...
In terms of comparing two numpy arrays and counting the number of matches (e.g. correct class prediction in machine learning), I found the below example for two dimensions useful:
import numpy as np
result = np.random.randint(3,size=(5,2)) # 5x2 random integer array
target = np.random.randint(3,size=(5,2)) # 5x2 random integer array
res = np.equal(result,target)
print result
print target
print np.sum(res[:,0])
print np.sum(res[:,1])
which can be extended to D dimensions.
The results are:
Prediction:
[[1 2]
[2 0]
[2 0]
[1 2]
[1 2]]
Target:
[[0 1]
[1 0]
[2 0]
[0 0]
[2 1]]
Count of correct prediction for D=1: 1
Count of correct prediction for D=2: 2
boolarr.sum(axis=1 or axis=0)
axis = 1 will output number of trues in a row and axis = 0 will count number of trues in columns
so
boolarr[[true,true,true],[false,false,true]]
print(boolarr.sum(axis=1))
will be
(3,1)
b[b].size
where b is the Boolean ndarray in question. It filters b for True, and then count the length of the filtered array.
This probably isn't as efficient np.count_nonzero() mentioned previously, but is useful if you forget the other syntax. Plus, this shorter syntax saves programmer time.
Demo:
In [1]: a = np.array([0,1,3])
In [2]: a
Out[2]: array([0, 1, 3])
In [3]: a[a>=1].size
Out[3]: 2
In [5]: b=a>=1
In [6]: b
Out[6]: array([False, True, True])
In [7]: b[b].size
Out[7]: 2
For 1D array, this is what worked for me:
import numpy as np
numbers= np.array([3, 1, 5, 2, 5, 1, 1, 5, 1, 4, 2, 1, 4, 5, 3, 4,
5, 2, 4, 2, 6, 6, 3, 6, 2, 3, 5, 6, 5])
numbersGreaterThan2= np.count_nonzero(numbers> 2)

Categories