Remove rows by duplicate column(s) values

Remove rows by duplicate column(s) values - python

I have a large dataset in a numpy.ndarray similar to this:
array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
I would like to remove duplicate rows, where the 0th, 1st, 2nd and 5th columns are equal. I.e. On the above matrix, the response would be:
-4, 5, 9, 30, 50, 80
2, -6, 9, 34, 12, 7
5, -9, 0, 32, 18, 0
numpy.unique does something very similar but it only finds duplicates over all columns (axis). I only want specific columns. How would one get around to do this with numpy? I could not find any decent numpy algorithm to do this. Is there a better module?

Use np.unique on the sliced array with return_index param over axis=0, that gives us unique indices, considering each row as one entity. These indices could be then used for row-indexing into the original array for the desired output.
So, with a as the input array, it would be -
a[np.unique(a[:,[0,1,2,5]],return_index=True,axis=0)[1]]
Sample run to break down the steps and hopefully make things clear -
In [29]: a
Out[29]:
array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
In [30]: a_slice = a[:,[0,1,2,5]]
In [31]: _, unq_row_indices = np.unique(a_slice,return_index=True,axis=0)
In [32]: final_output = a[unq_row_indices]
In [33]: final_output
Out[33]:
array([[-4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ 5, -9, 0, 32, 18, 0]])

Pandas has functionality for this via pd.DataFrame.drop_duplicates. However, the convenient syntax comes at the cost of performance.
import pandas as pd
import numpy as np
A = np.array([[ -4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ -4, 5, 9, 98, -21, 80],
[ 5, -9, 0, 32, 18, 0]])
res = pd.DataFrame(A)\
.drop_duplicates(subset=[0, 1, 2, 5])\
.values
print(res)
array([[-4, 5, 9, 30, 50, 80],
[ 2, -6, 9, 34, 12, 7],
[ 5, -9, 0, 32, 18, 0]])

You can use the np.take method (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.take.html) to get the only the columns from the array that you care about and then use the unique method with return_index=True.
>>> arr = np.array([[ -4, 5, 9, 30, 50, 80],
... [ 2, -6, 9, 34, 12, 7],
... [ -4, 5, 9, 98, -21, 80],
... [ 5, -9, 0, 32, 18, 0]])
>>> relevant_columns = np.take(arr, [0,1,2,5], axis=1)
>>> np.unique(relevant_columns, axis=0, return_index=True)
(array([[ 2, -6, 9, 7],
[ 5, -9, 0, 0],
[-4, 5, 9, 80]]), array([1, 3, 0]))
You can then use np.take() again with your original numpy array. Pass array([1, 3, 0]) as the parameter for the indices.

Related

Select non-consecutive row and column indices from 2d numpy array

I have an array a
a = np.arange(5*5).reshape(5,5)
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
and want to select the last two columns from row one and two, and the first two columns of row three and four.
The result should look like this
array([[3, 4, 10, 11],
[8, 9, 15, 16]])
How to do that in one go without indexing twice and concatenation?
I tried using take
a.take([[0,1,2,3], [3,4,0,1]])
array([[0, 1, 2, 3],
[3, 4, 0, 1]])
ix_
a[np.ix_([0,1,2,3], [3,4,0,1])]
array([[ 3, 4, 0, 1],
[ 8, 9, 5, 6],
[13, 14, 10, 11],
[18, 19, 15, 16]])
and r_
a[np.r_[0:2, 2:4], np.r_[3:5, 0:2]]
array([ 3, 9, 10, 16])
and a combination of ix_ and r_
a[np.ix_([0,1,2,3], np.r_[3:4, 0:1])]
array([[ 3, 0],
[ 8, 5],
[13, 10],
[18, 15]])

Using integer advanced indexing, you can do something like this
index_rows = np.array([
[0, 0, 2, 2],
[1, 1, 3, 3],
])
index_cols = np.array([
[-2, -1, 0, 1],
[-2, -1, 0, 1],
])
a[index_rows, index_cols]
where you just select directly what elements you want.

Finding max values from given subarrays using numpy's strides

I'm given such 2D-array.
My task is to find max values in subarrays painted by different colours. I have to use strides and as_strided. So far my code looked like this:
a=np.vstack(([0,1,2,3,4,5],[6,7,8,9,10,11],[12,13,14,15,16,17],[18,19,20,21,22,23]))
print(np.max(np.lib.stride_tricks.as_strided(a,(2,3),strides=(24,4))))
It properly shows the maximum value of the first block, which is 8, but i have no idea how can i move to the other parts of the matrix.Is there any way i could move to the other parts of the matrix so i could show the max value of the subarray?
NOTE: This task is from my introduction classes to python programming, so there is no need to write any sophisticated functions, i would even say it is inadvisable

In [297]: a = np.arange(24).reshape(4,6)
In [298]: a
Out[298]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
If we reshape size 4 dim to (2,2), and the 6 to (2,3):
In [299]: a.reshape(2,2,2,3)
Out[299]:
array([[[[ 0, 1, 2],
[ 3, 4, 5]],
[[ 6, 7, 8],
[ 9, 10, 11]]],
[[[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23]]]])
The order of the blocks isn't right, but we can correct that with a transpose:
In [300]: a.reshape(2,2,2,3).transpose(0,2,1,3)
Out[300]:
array([[[[ 0, 1, 2],
[ 6, 7, 8]],
[[ 3, 4, 5],
[ 9, 10, 11]]],
[[[12, 13, 14],
[18, 19, 20]],
[[15, 16, 17],
[21, 22, 23]]]])
and the get the max of each of the 2d inner blocks:
In [301]: a.reshape(2,2,2,3).transpose(0,2,1,3).max((2,3))
Out[301]:
array([[ 8, 11],
[20, 23]])
a.reshape(2,2,2,3).max((1,3)) gets the same max.
OK, that wasn't done with strides, but it gives me ideas of how to use strides.
strides of a itself:
In [303]: a.strides
Out[303]: (48, 8)
after reshape:
In [304]: a.reshape(2,2,2,3).strides
Out[304]: (96, 48, 24, 8)
and after transpose:
In [305]: a.reshape(2,2,2,3).transpose(0,2,1,3).strides
Out[305]: (96, 24, 48, 8)
So we can use those strides directly:
In [313]: np.lib.stride_tricks.as_strided(a,(2,2,2,3),(96,24,48,8))
Out[313]:
array([[[[ 0, 1, 2],
[ 6, 7, 8]],
[[ 3, 4, 5],
[ 9, 10, 11]]],
[[[12, 13, 14],
[18, 19, 20]],
[[15, 16, 17],
[21, 22, 23]]]])

import numpy as np
from numpy.lib.stride_tricks import as_strided
matrix = np.arange(24).reshape(4, 6)
maxtrix = np.array([as_strided(matrix[0], (2, 3), matrix.strides).max(),
as_strided(matrix[0][3:6], (2, 3), matrix.strides).max(),
as_strided(matrix[2], (2, 3), matrix.strides).max(),
as_strided(matrix[2][3:6], (2, 3), matrix.strides).max()
]).reshape(2, 2)

Matrix to Vector with python/numpy

Numpy ravel works well if I need to create a vector by reading by rows or by columns. However, I would like to transform a matrix to a 1d array, by using a method that is often used in image processing. This is an example with initial matrix A and final result B:
A = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
B = np.array([[ 0, 1, 4, 8, 5, 2, 3, 6, 9, 12, 13, 10, 7, 11, 14, 15])
Is there an existing function already that could help me with that? If not, can you give me some hints on how to solve this problem? PS. the matrix A is NxN.

I've been using numpy for several years, and I've never seen such a function.
Here's one way you could do it (not necessarily the most efficient):
In [47]: a
Out[47]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [48]: np.concatenate([np.diagonal(a[::-1,:], k)[::(2*(k % 2)-1)] for k in range(1-a.shape[0], a.shape[0])])
Out[48]: array([ 0, 1, 4, 8, 5, 2, 3, 6, 9, 12, 13, 10, 7, 11, 14, 15])
Breaking down the one-liner into separate steps:
a[::-1, :] reverses the rows:
In [59]: a[::-1, :]
Out[59]:
array([[12, 13, 14, 15],
[ 8, 9, 10, 11],
[ 4, 5, 6, 7],
[ 0, 1, 2, 3]])
(This could also be written a[::-1] or np.flipud(a).)
np.diagonal(a, k) extracts the kth diagonal, where k=0 is the main diagonal. So, for example,
In [65]: np.diagonal(a[::-1, :], -3)
Out[65]: array([0])
In [66]: np.diagonal(a[::-1, :], -2)
Out[66]: array([4, 1])
In [67]: np.diagonal(a[::-1, :], 0)
Out[67]: array([12, 9, 6, 3])
In [68]: np.diagonal(a[::-1, :], 2)
Out[68]: array([14, 11])
In the list comprehension, k gives the diagonal to be extracted. We want to reverse the elements in every other diagonal. The expression 2*(k % 2) - 1 gives the values 1, -1, 1, ... as k varies from -3 to 3. Indexing with [::1] leaves the order of the array being indexed unchanged, and indexing with [::-1] reverses the order of the array. So np.diagonal(a[::-1, :], k)[::(2*(k % 2)-1)] gives the kth diagonal, but with every other diagonal reversed:
In [71]: [np.diagonal(a[::-1,:], k)[::(2*(k % 2)-1)] for k in range(1-a.shape[0], a.shape[0])]
Out[71]:
[array([0]),
array([1, 4]),
array([8, 5, 2]),
array([ 3, 6, 9, 12]),
array([13, 10, 7]),
array([11, 14]),
array([15])]
np.concatenate() puts them all into a single array:
In [72]: np.concatenate([np.diagonal(a[::-1,:], k)[::(2*(k % 2)-1)] for k in range(1-a.shape[0], a.shape[0])])
Out[72]: array([ 0, 1, 4, 8, 5, 2, 3, 6, 9, 12, 13, 10, 7, 11, 14, 15])

I found discussion of zigzag scan for MATLAB, but not much for numpy. One project appears to use a hardcoded indexing array for 8x8 blocks
https://github.com/lot9s/lfv-compression/blob/master/scripts/our_mpeg/zigzag.py
ZIG = np.array([[0, 1, 5, 6, 14, 15, 27, 28],
[2, 4, 7, 13, 16, 26, 29, 42],
[3, 8, 12, 17, 25, 30, 41, 43],
[9, 11, 18, 24, 31, 40, 44,53],
[10, 19, 23, 32, 39, 45, 52,54],
[20, 22, 33, 38, 46, 51, 55,60],
[21, 34, 37, 47, 50, 56, 59,61],
[35, 36, 48, 49, 57, 58, 62,63]])
Apparently it's used jpeg and mpeg compression.

Find all n-dimensional lines and diagonals with NumPy

Using NumPy, I would like to produce a list of all lines and diagonals of an n-dimensional array with lengths of k.
Take the case of the following three-dimensional array with lengths of three.
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
For this case, I would like to obtain all of the following types of sequences. For any given case, I would like to obtain all of the possible sequences of each type. Examples of desired sequences are given in parentheses below, for each case.
1D lines
x axis (0, 1, 2)
y axis (0, 3, 6)
z axis (0, 9, 18)
2D diagonals
x/y axes (0, 4, 8, 2, 4, 6)
x/z axes (0, 10, 20, 2, 10, 18)
y/z axes (0, 12, 24, 6, 12, 18)
3D diagonals
x/y/z axes (0, 13, 26, 2, 13, 24)
The solution should be generalized, so that it will generate all lines and diagonals for an array, regardless of the array's number of dimensions or length (which is constant across all dimensions).

This solution generalized over n
Lets rephrase this problem as "find the list of indices".
We're looking for all of the 2d index arrays of the form
array[i[0], i[1], i[2], ..., i[n-1]]
Let n = arr.ndim
Where i is an array of shape (n, k)
Each of i[j] can be one of:
The same index repeated n times, ri[j] = [j, ..., j]
The forward sequence, fi = [0, 1, ..., k-1]
The backward sequence, bi = [k-1, ..., 1, 0]
With the requirements that each sequence is of the form ^(ri)*(fi)(fi|bi|ri)*$ (using regex to summarize it). This is because:
there must be at least one fi so the "line" is not a point selected repeatedly
no bis come before fis, to avoid getting reversed lines
def product_slices(n):
for i in range(n):
yield (
np.index_exp[np.newaxis] * i +
np.index_exp[:] +
np.index_exp[np.newaxis] * (n - i - 1)
)
def get_lines(n, k):
"""
Returns:
index (tuple): an object suitable for advanced indexing to get all possible lines
mask (ndarray): a boolean mask to apply to the result of the above
"""
fi = np.arange(k)
bi = fi[::-1]
ri = fi[:,None].repeat(k, axis=1)
all_i = np.concatenate((fi[None], bi[None], ri), axis=0)
# inedx which look up every possible line, some of which are not valid
index = tuple(all_i[s] for s in product_slices(n))
# We incrementally allow lines that start with some number of `ri`s, and an `fi`
# [0] here means we chose fi for that index
# [2:] here means we chose an ri for that index
mask = np.zeros((all_i.shape[0],)*n, dtype=np.bool)
sl = np.index_exp[0]
for i in range(n):
mask[sl] = True
sl = np.index_exp[2:] + sl
return index, mask
Applied to your example:
# construct your example array
n = 3
k = 3
data = np.arange(k**n).reshape((k,)*n)
# apply my index_creating function
index, mask = get_lines(n, k)
# apply the index to your array
lines = data[index][mask]
print(lines)
array([[ 0, 13, 26],
[ 2, 13, 24],
[ 0, 12, 24],
[ 1, 13, 25],
[ 2, 14, 26],
[ 6, 13, 20],
[ 8, 13, 18],
[ 6, 12, 18],
[ 7, 13, 19],
[ 8, 14, 20],
[ 0, 10, 20],
[ 2, 10, 18],
[ 0, 9, 18],
[ 1, 10, 19],
[ 2, 11, 20],
[ 3, 13, 23],
[ 5, 13, 21],
[ 3, 12, 21],
[ 4, 13, 22],
[ 5, 14, 23],
[ 6, 16, 26],
[ 8, 16, 24],
[ 6, 15, 24],
[ 7, 16, 25],
[ 8, 17, 26],
[ 0, 4, 8],
[ 2, 4, 6],
[ 0, 3, 6],
[ 1, 4, 7],
[ 2, 5, 8],
[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 13, 17],
[11, 13, 15],
[ 9, 12, 15],
[10, 13, 16],
[11, 14, 17],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17],
[18, 22, 26],
[20, 22, 24],
[18, 21, 24],
[19, 22, 25],
[20, 23, 26],
[18, 19, 20],
[21, 22, 23],
[24, 25, 26]])
Another good set of test data is np.moveaxis(np.indices((k,)*n), 0, -1), which gives an array where every value is its own index
I've solved this problem before to implement a higher dimensional tic-tac-toe

In [1]: x=np.arange(27).reshape(3,3,3)
Selecting individual rows is easy:
In [2]: x[0,0,:]
Out[2]: array([0, 1, 2])
In [3]: x[0,:,0]
Out[3]: array([0, 3, 6])
In [4]: x[:,0,0]
Out[4]: array([ 0, 9, 18])
You could iterate over dimensions with an index list:
In [10]: idx=[slice(None),0,0]
In [11]: x[idx]
Out[11]: array([ 0, 9, 18])
In [12]: idx[2]+=1
In [13]: x[idx]
Out[13]: array([ 1, 10, 19])
Look at the code for np.apply_along_axis to see how it implements this sort of iteration.
Reshape and split can also produce a list of rows. For some dimensions this might require a transpose:
In [20]: np.split(x.reshape(x.shape[0],-1),9,axis=1)
Out[20]:
[array([[ 0],
[ 9],
[18]]), array([[ 1],
[10],
[19]]), array([[ 2],
[11],
...
np.diag can get diagonals from 2d subarrays
In [21]: np.diag(x[0,:,:])
Out[21]: array([0, 4, 8])
In [22]: np.diag(x[1,:,:])
Out[22]: array([ 9, 13, 17])
In [23]: np.diag?
In [24]: np.diag(x[1,:,:],1)
Out[24]: array([10, 14])
In [25]: np.diag(x[1,:,:],-1)
Out[25]: array([12, 16])
And explore np.diagonal for direct application to the 3d. It's also easy to index the array directly, with range and arange, x[0,range(3),range(3)].
As far as I know there isn't a function to step through all these alternatives. Since dimensions of the returned arrays can differ, there's little point to producing such a function in compiled numpy code. So even if there was a function, it would step through the alternatives as I outlined.
==============
All the 1d lines
x.reshape(-1,3)
x.transpose(0,2,1).reshape(-1,3)
x.transpose(1,2,0).reshape(-1,3)
y/z diagonal and anti-diagonal
In [154]: i=np.arange(3)
In [155]: j=np.arange(2,-1,-1)
In [156]: np.concatenate((x[:,i,i],x[:,i,j]),axis=1)
Out[156]:
array([[ 0, 4, 8, 2, 4, 6],
[ 9, 13, 17, 11, 13, 15],
[18, 22, 26, 20, 22, 24]])

np.einsum can be used to build all these kind of expressions; for instance:
# 3d diagonals
print(np.einsum('iii->i', a))
# 2d diagonals
print(np.einsum('iij->ij', a))
print(np.einsum('iji->ij', a))

Reshaping array into a square array Python

I have an array of numbers whose shape is 26*43264. I would like to reshape this into an array of shape 208*208 but in chunks of 26*26.
[[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10,11,12,13,14,15,16,17,18,19]]
becomes something like:
[[0, 1, 2, 3, 4],
[10,11,12,13,14],
[ 5, 6, 7, 8, 9],
[15,16,17,18,19]]

This kind of reshaping question has come up before. But rather than search I'll quickly demonstate a numpy approach
make your sample array:
In [473]: x=np.arange(20).reshape(2,10)
In [474]: x
Out[474]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
Use reshape to split it into blocks of 5
In [475]: x.reshape(2,2,5)
Out[475]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9]],
[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]])
and use transpose to reorder dimensions, and in effect reorder those rows
In [476]: x.reshape(2,2,5).transpose(1,0,2)
Out[476]:
array([[[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14]],
[[ 5, 6, 7, 8, 9],
[15, 16, 17, 18, 19]]])
and another shape to consolidate the 1st 2 dimensions
In [477]: x.reshape(2,2,5).transpose(1,0,2).reshape(4,5)
Out[477]:
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[ 5, 6, 7, 8, 9],
[15, 16, 17, 18, 19]])
If x is already a numpy array, these transpose and reshape operations are cheap (time wise). If x was really nested lists, then the other solution with list operations will be faster, since making a numpy array has overhead.

A little ugly, but here's a one-liner for the small example that you should be able to modify for the full size one:
In [29]: from itertools import chain
In [30]: np.array(list(chain(*[np.arange(20).reshape(4,5)[i::2] for i in xrange(2)])))
Out[30]:
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[ 5, 6, 7, 8, 9],
[15, 16, 17, 18, 19]])
EDIT: Here's a more generalized version in a function. Uglier code, but the function just takes an array and a number of segments you'd like to end up with.
In [57]: def break_arr(arr, chunks):
....: to_take = arr.shape[1]/chunks
....: return np.array(list(chain(*[arr.take(xrange(x*to_take, x*to_take+to_take), axis=1) for x in xrange(chunks)])))
....:
In [58]: arr = np.arange(40).reshape(4,10)
In [59]: break_arr(arr, 5)
Out[59]:
array([[ 0, 1],
[10, 11],
[20, 21],
[30, 31],
[ 2, 3],
[12, 13],
[22, 23],
[32, 33],
[ 4, 5],
[14, 15],
[24, 25],
[34, 35],
[ 6, 7],
[16, 17],
[26, 27],
[36, 37],
[ 8, 9],
[18, 19],
[28, 29],
[38, 39]])
In [60]: break_arr(arr, 2)
Out[60]:
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[20, 21, 22, 23, 24],
[30, 31, 32, 33, 34],
[ 5, 6, 7, 8, 9],
[15, 16, 17, 18, 19],
[25, 26, 27, 28, 29],
[35, 36, 37, 38, 39]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove rows by duplicate column(s) values - python

Related

Select non-consecutive row and column indices from 2d numpy array

Finding max values from given subarrays using numpy's strides

Matrix to Vector with python/numpy

Find all n-dimensional lines and diagonals with NumPy

Reshaping array into a square array Python

Categories

Resources