classify np.arrays as duplicates

classify np.arrays as duplicates - python

My goal is to take a list of np.arrays and create an associated list or array that classifies each as having a duplicate or not. Here's what I thought would work:
www = [np.array([1, 1, 1]), np.array([1, 1, 1]), np.array([2, 1, 1])]
uniques, counts = np.unique(www, axis = 0, return_counts = True)
counts = [1 if x > 1 else 0 for x in counts]
count_dict = dict(zip(uniques, counts))
[count_dict[i] for i in www]
The desired output for this case would be :
[1, 1, 0]
because the first and second element have another copy within the original list. It seems that the problem is that I cannot use a np.array as a key for a dictionary.
Suggestions?

First convert www to a 2D Numpy array then do the following:
In [18]: (counts[np.where((www[:,None] == uniques).all(2))[1]] > 1).astype(int)
Out[18]: array([1, 1, 0])
here we use broadcasting for check the equality of all www rows with uniques array and then using all() on last axis to find out which of its rows are completely equal to uniques rows.
Here's the elaborated results:
In [20]: (www[:,None] == uniques).all(2)
Out[20]:
array([[ True, False],
[ True, False],
[False, True]])
# Respective indices in `counts` array
In [21]: np.where((www[:,None] == uniques).all(2))[1]
Out[21]: array([0, 0, 1])
In [22]: counts[np.where((www[:,None] == uniques).all(2))[1]] > 1
Out[22]: array([ True, True, False])
In [23]: (counts[np.where((www[:,None] == uniques).all(2))[1]] > 1).astype(int)
Out[23]: array([1, 1, 0])

In Python, lists (and numpy arrays) cannot be hashed, so they can't be used as dictionary keys. But tuples can! So one option would be to convert your original list to a tuple, and to convert uniques to a tuple. The following works for me:
www = [np.array([1, 1, 1]), np.array([1, 1, 1]), np.array([2, 1, 1])]
www_tuples = [tuple(l) for l in www] # list of tuples
uniques, counts = np.unique(www, axis = 0, return_counts = True)
counts = [1 if x > 1 else 0 for x in counts]
# convert uniques to tuples
uniques_tuples = [tuple(l) for l in uniques]
count_dict = dict(zip(uniques_tuples, counts))
[count_dict[i] for i in www_tuples]
Just a heads-up: this will double your memory consumption, so it may not be the best solution if www is large.
You can mitigate the extra memory consumption by ingesting your data as tuples instead of numpy arrays if possible.

Related

Loop append of numpy array element in python

Who can explain how to loop add an element to the numpy array by condition?
I wrote some code that should do add element 2 if i element of array A is 0 and add element 1 if i element of array A is not 0.
Here is the code itself:
import numpy as np
def finalconcat(somearray):
for i in somearray:
arraysome=[]
if somearray[i]==0:
arraysome=np.append(arraysome,[2],axis=0)
else:
arraysome=np.append(arraysome,[1],axis=0)
return arraysome
Let me give you an example:
A=np.array([1,0,2,3,4,5])
C=finalconcat(B)
print(C)
It should come out:
[1,2,1,1,1,1]
But it comes out like:
[1.]
Please explain to me what is wrong here, I just don't understand what could be wrong...

You have several issues:
arraysome=[] is inside your loop so for each iteration of somearray you are emptying arraysome. Consequently, you can never end up with more than one element in arraysome when you are all done.
You have for i in somearray. On each iteration i will be the next element of somearray; it will not be iterating indices of the array. Yet later you have if somearray[i]==0:. This should just be if i==0:.
If you want the resulting elements of arraysome to be integers rather than floats, then you should initialize it to be a an numpy array of integers.
You have C=finalconcat(B), but B is not defined.
You should really spend some time reading the PEP 8 – Style Guide for Python Code.
import numpy as np
def finalconcat(somearray):
arraysome = np.array([], dtype=np.int)
for i in somearray:
if i == 0:
arraysome = np.append(arraysome, [2], axis=0)
else:
arraysome = np.append(arraysome, [1], axis=0)
return arraysome
a = np.array([1, 0, 2, 3, 4, 5])
c = finalconcat(a)
print(c)
Prints:
[1 2 1 1 1 1]

For iteration like this it's better to use lists. np.append is just a poorly named cover for np.concatenate, which returns a whole new array with each call. List append works in-place, and is more efficient. And easier to use:
def finalconcat(somearray):
rec = [2 if i==0 else 1 for i in somearray]
# arr = np.array(rec)
return rec
In [31]: a = np.array([1, 0, 2, 3, 4, 5])
In [32]: np.array([2 if i==0 else 1 for i in a])
Out[32]: array([1, 2, 1, 1, 1, 1])
But it's better to use whole-array methods, such as:
In [33]: b = np.ones_like(a)
In [34]: b
Out[34]: array([1, 1, 1, 1, 1, 1])
In [35]: b[a==0] = 2
In [36]: b
Out[36]: array([1, 2, 1, 1, 1, 1])
or
In [37]: np.where(a==0, 2, 1)
Out[37]: array([1, 2, 1, 1, 1, 1])

What is the '<' doing in this line?: data += dt < b

Input:
dt = [6,7,8,9,10]
data = [1,2,3,4,5]
b = 8.0
b = np.require(b, dtype=np.float)
data += dt < b
data
Output:
array([2, 3, 3, 4, 5])
I tried to input different number but still couldn't figure out what's the "<" doing there....
Also, it seems to work only when b is np.float (hence the conversion).

The < with numpy arrays does an element-wise comparison. That means it returns an array where there is a True where the condition is true and False if not. The np.require line is necessary here so it actually uses NumPy arrays. You could drop the np.require if you converted your data and dt to np.arrays beforehand.
Then the result is added (element-wise) to the numeric array. In this context True is equal to 1 and False to zero.
>>> dt < b # which elements are smaller than b?
array([ True, True, False, False, False])
>>> 0 + (dt < b) # boolean arrays in arithmetic operations with numbers
array([1, 1, 0, 0, 0])
So it adds 1 to every element of data where the element in dt is smaller than 8.

dt is a list:
In [50]: dt = [6,7,8,9,10]
In [51]: dt < 8
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-51-3d06f93227f5> in <module>()
----> 1 dt < 8
TypeError: '<' not supported between instances of 'list' and 'int'
< (.__lt__) is not defined for lists.
But if one element of the comparison is an ndarray, then the numpy definition of __lt__ applies. dt is turned into an array, and it does an element by element comparison.
In [52]: dt < np.array(8)
Out[52]: array([ True, True, False, False, False])
In [53]: np.array(dt) < 8
Out[53]: array([ True, True, False, False, False])
numpy array operations also explain the data += part:
In [54]: data = [1,2,3,4,5] # a list
In [55]: data + (dt < np.array(8)) # list=>array, and boolean array to integer array
Out[55]: array([2, 3, 3, 4, 5])
In [56]: data
Out[56]: [1, 2, 3, 4, 5]
In [57]: data += (dt < np.array(8))
In [58]: data
Out[58]: array([2, 3, 3, 4, 5])
Actually I'm a bit surprised that with the += data has been changed from list to array. It means the data+=... has been implemented as an assignment:
data = data + (dt <np.array(8))
Normally + for a list is a concatenate:
In [61]: data += ['a','b','c']
In [62]: data
Out[62]: [1, 2, 3, 4, 5, 'a', 'b', 'c']
# equivalent of: data.extend(['a','b','c'])
You can often get away with using lists in array contexts, but it's better to make objects arrays, so you do get these implicit, and sometimes unexpected, conversions.

This is just an alias (or shortcut or convenience notation) to the equivalent function: numpy.less()
In [116]: arr1 = np.arange(8)
In [117]: scalar = 6.0
# comparison that generates a boolean mask
In [118]: arr1 < scalar
Out[118]: array([ True, True, True, True, True, True, False, False])
# same operation as above
In [119]: np.less(arr1, scalar)
Out[119]: array([ True, True, True, True, True, True, False, False])
Let's see how this boolean array can be added to a non-boolean array in this case. It is possible due to type coercion
# sample array
In [120]: some_arr = np.array([1, 1, 1, 1, 1, 1, 1, 1])
# addition after type coercion
In [122]: some_arr + (arr1 < scalar)
Out[122]: array([2, 2, 2, 2, 2, 2, 1, 1])
# same output achieved with `numpy.less()`
In [123]: some_arr + np.less(arr1, scalar)
Out[123]: array([2, 2, 2, 2, 2, 2, 1, 1])
So, type coercion happens on the boolean array and then addition is performed.

Python - Convert the array in a tuple to just a normal array

I have a signal where I want to find the average height of the values. This is done by finding the zero crossings and calculating the max and min between each zero crossing, then averaging these values.
My problem occurs when I want to use np.where() to find where the signal is crossing zero. When I use np.where() I get the result in a tuple, but I want it in an array where I can count the amount of times zero is crossed.
I am new to Python and coming from Matlab it is a bit confusing with all the different classes. As you can see, I get an error because nu = len(zero_u) gives 1 as a result, because the whole array is written in a tuple as one element.
Any ideas how to go around this?
The code looks like this:
import numpy as np
def averageheight(f):
rms = np.std(f)
f = f + (rms * 10**-6)
# Find zero crossing
fsign = np.sign(f)
fdiff = np.diff(fsign)
zero_u = np.asarray(np.where(fdiff > 0)) + 1
zero_d = np.asarray(np.where(fdiff < 0)) + 1
nu = len(zero_u)
nd = len(zero_d)
value_max = np.zeros((nu, 1))
value_min = np.zeros((nu, 1))
imaxvec = np.zeros((nu, 1))
iminvec = np.zeros((nu, 1))
if (nu > 2) and (nd > 2):
if zero_u[0] > zero_d[0]:
zero_d[0] = []
nu = len(zero_u)
nd = len(zero_d)
ncross = np.fmin(nu, nd)
# Find Maxima:
for ic in range(0, ncross - 1):
up = int(zero_u[ic])
down = int(zero_d[ic])
fvec = f[up:down]
value_max[ic] = np.amax(fvec)
index_max = value_max.argmax()
imaxvec[ic] = up + index_max - 1
# Find Minima:
for ic in range(0, ncross - 2):
down = int(zero_d[ic])
up = int(zero_u[ic+1])
fvec = f[down:up]
value_min[ic] = np.amin(fvec)
index_min = value_min.argmin()
iminvec[ic] = down + index_min - 1
# Remove spurious values, bumps and zero_d
thr = rms/3
maxfind = np.where(value_max < thr)
for i in range(0, len(maxfind)):
imaxfind = np.where(value_max == maxfind[i])
imaxvec[imaxfind] = 0
value_max[imaxfind] = 0
minfind = np.where(value_min > -thr)
for j in range(0, len(minfind)):
iminfind = np.where(value_min == minfind[j])
value_min[iminfind] = 0
iminvec[iminfind] = 0
# Find Average Height
avh = np.mean(value_max) - np.mean(value_min)
else:
avh = 0
return avh

np.where, and np.nonzero even more so, clearly explains that it returns a tuple, with one array for each dimension of the condition array:
In [71]: arr = np.random.randint(-5,5,10)
In [72]: arr
Out[72]: array([ 3, 4, 2, -3, -1, 0, -5, 4, 2, -3])
In [73]: arr.shape
Out[73]: (10,)
In [74]: np.where(arr>=0)
Out[74]: (array([0, 1, 2, 5, 7, 8]),)
In [75]: arr[_]
Out[75]: array([3, 4, 2, 0, 4, 2])
That Out[74] tuple can be used directly as an index.
You can also extract the array from the tuple:
In [76]: np.where(arr>=0)[0]
Out[76]: array([0, 1, 2, 5, 7, 8])
That, I think is a better choice than the np.asarray(np.where(...))
This convention for where becomes clearer when we use it on a 2d array
In [77]: arr2 = arr.reshape(2,5)
In [78]: np.where(arr2>=0)
Out[78]: (array([0, 0, 0, 1, 1, 1]), array([0, 1, 2, 0, 2, 3]))
In [79]: arr2[_]
Out[79]: array([3, 4, 2, 0, 4, 2])
Again we are indexing with a tuple. arr2[1,3] is really arr2[(1,3)]. The values in [] indexing brackets are actually passed to the indexing function as a tuple of values.
np.argwhere applies transpose to the result of where, producing an array:
In [80]: np.transpose(np.where(arr2>=0))
Out[80]:
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 2],
[1, 3]])
That's the same indexing arrays, but arranged in a 2d column matrix.
If you need the count of where without the actual values, a slightly faster function is
In [81]: np.count_nonzero(arr>=0)
Out[81]: 6
In fact np.nonzero uses the count to first determine the size of the arrays that it will return.

How to group values in matrix with items of unequal length

Lets say I have a simple array:
a = np.arange(3)
And an array of indices with the same length:
I = np.array([0, 0, 1])
I now want to group the values based on the indices.
How would I group the elements of the first array to produce the result below?
np.array([[0, 1], [2], dtype=object)
Here is what I tried:
a = np.arange(3)
I = np.array([0, 0, 1])
out = np.empty(2, dtype=object)
out.fill([])
aslists = np.vectorize(lambda x: [x], otypes=['object'])
out[I] += aslists(a)
However, this approach does not concatenate the lists, but only maintains the last value for each index:
array([[1], [2]], dtype=object)
Or, for a 2-dimensional case:
a = np.random.rand(100)
I = (np.random.random(100) * 5 //1).astype(int)
J = (np.random.random(100) * 5 //1).astype(int)
out = np.empty((5, 5), dtype=object)
out.fill([])
How can I append the items from a to out based on the two index arrays?

1D Case
Assuming I being sorted, for a list of arrays as output -
idx = np.unique(I, return_index=True)[1]
out = np.split(a,idx)[1:]
Another with slicing to get idx for splitting a -
out = np.split(a, np.flatnonzero(I[1:] != I[:-1])+1)
To get an array of lists as output -
np.array([i.tolist() for i in out])
Sample run -
In [84]: a = np.arange(3)
In [85]: I = np.array([0, 0, 1])
In [86]: out = np.split(a, np.flatnonzero(I[1:] != I[:-1])+1)
In [87]: out
Out[87]: [array([0, 1]), array([2])]
In [88]: np.array([i.tolist() for i in out])
Out[88]: array([[0, 1], [2]], dtype=object)
2D Case
For 2D case of filling into a 2D array with groupings made from indices in two arrays I and J that represent the rows and columns where the groups are to be assigned, we could do something like this -
ncols = 5
lidx = I*ncols+J
sidx = lidx.argsort() # Use kind='mergesort' to keep order
lidx_sorted = lidx[sidx]
unq_idx, split_idx = np.unique(lidx_sorted, return_index=True)
out.flat[unq_idx] = np.split(a[sidx], split_idx)[1:]

Set duplicate elements as zeros

How can I convert the duplicate elements in a array 'data' into 0? It has to be done row-wise.
data = np.array([[1,8,3,3,4],
[1,8,9,9,4]])
The answer should be as follows:
ans = array([[1,8,3,0,4],
[1,8,9,0,4]])

Approach #1
One approach with np.unique -
# Find out the unique elements and their starting positions
unq_data, idx = np.unique(data,return_index=True)
# Find out the positions for each unique element, their duplicate positions
dup_idx = np.setdiff1d(np.arange(data.size),idx)
# Set those duplicate positioned elemnents to 0s
data[dup_idx] = 0
Sample run -
In [46]: data
Out[46]: array([1, 8, 3, 3, 4, 1, 3, 3, 9, 4])
In [47]: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
In [48]: data
Out[48]: array([1, 8, 3, 0, 4, 0, 0, 0, 9, 0])
Approach #2
You can also use sorting and differentiation as a faster approach -
# Get indices for sorted data
sort_idx = np.argsort(data)
# Get duplicate indices and set those in data to 0s
dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
data[dup_idx] = 0
Runtime tests -
In [110]: data = np.random.randint(0,100,(10000))
...: data1 = data.copy()
...: data2 = data.copy()
...:
In [111]: def func1(data):
...: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
...: def func2(data):
...: sort_idx = np.argsort(data)
...: dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
...: data[dup_idx] = 0
...:
In [112]: %timeit func1(data1)
1000 loops, best of 3: 1.36 ms per loop
In [113]: %timeit func2(data2)
1000 loops, best of 3: 467 µs per loop
Extending to a 2D case :
Approach #2 could be extended to work for a 2D array case, avoiding any loop like so -
# Get indices for sorted data
sort_idx = np.argsort(data,axis=1)
# Get sorted linear indices
row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
sort_lin_idx = sort_idx[:,1::] + row_offset
# Get duplicate linear indices and set those in data as 0s
dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
data.ravel()[dup_lin_idx] = 0
Sample run -
In [6]: data
Out[6]:
array([[1, 8, 3, 3, 4, 0, 3, 3],
[1, 8, 9, 9, 4, 8, 7, 9],
[1, 8, 9, 9, 4, 8, 7, 3]])
In [7]: sort_idx = np.argsort(data,axis=1)
...: row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
...: sort_lin_idx = sort_idx[:,1::] + row_offset
...: dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
...: data.ravel()[dup_lin_idx] = 0
...:
In [8]: data
Out[8]:
array([[1, 8, 3, 0, 4, 0, 0, 0],
[1, 8, 9, 0, 4, 0, 7, 0],
[1, 8, 9, 0, 4, 0, 7, 3]])

Here's a simple pure-Python way to do it:
seen = set()
for i, x in enumerate(data):
if x in seen:
data[i] = 0
else:
seen.add(x)

You could use a nested for loop, where you compare each element of the array to every other element to check for duplicate records. Syntax might be a bit off as I am not really familiar with numpy.
for x in range(0, len(data))
for y in range(x+1, len(data))
if(data[x] == data[y])
data[x] = 0

#Divakar has it almost right, but there are a few things that can be further optimized, but don't really fit in a comment. To begin:
rows, cols = data.shape
The first operation is to sort the array to identify the duplicates. Since we will want to undo the sorting, we need to use np.argsort, but if you want to make sure that it is the first occurrence of each repeated value that is kept, you need to use a stable sorting algorithm:
sort_idx = data.argsort(axis=1, kind='mergesort')
Once we have the indices to sort data, to get a sorted copy of the array it is faster to use the indices than to re-sort the array:
sorted_data = data[np.arange(rows)[:, None], sort_idx]
While the principle is similar to that in using np.diff, it is typically faster to use boolean operations. We want an array full of False where the first occurrences of each value happen, and True where the duplicates are:
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
We now use that mask to set all the duplicates to zero:
sorted_data[sorted_mask] = 0
And we finally undo the sorting. To revert a permutation you can sort the indices that define it, i.e. you could do:
invert_idx = sort_idx.argsort(axis=1, kind='mergesort')
ans = sorted_data[np.arange(rows)[:, None], invert_idx]
But it is more efficient to use assignment, i.e.:
ans = np.empty_like(data)
ans[np.arange(rows), sort_idx] = sorted_data
Putting it all together:
def zero_dups(data):
rows, cols = data.shape
sort_idx = data.argsort(axis=1, kind='mergesort')
sorted_data = data[np.arange(rows)[:, None], sort_idx]
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
sorted_data[sorted_mask] = 0
ans = np.empty_like(data)
ans[np.arange(rows)[:, None], sort_idx] = sorted_data
return ans

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

classify np.arrays as duplicates - python

Related

Loop append of numpy array element in python

What is the '<' doing in this line?: data += dt < b

Python - Convert the array in a tuple to just a normal array

How to group values in matrix with items of unequal length

Set duplicate elements as zeros

Categories

Resources