Delete rows in ndarray where sum of multiple indexes is 0 - python

So I have a very large two-dimensional numpy array such as:
array([[ 2, 4, 0, 0, 0, 5, 9, 0],
[ 2, 3, 0, 1, 0, 3, 1, 1],
[ 1, 5, 4, 3, 2, 7, 8, 3],
[ 0, 7, 0, 0, 0, 6, 4, 4],
...,
[ 6, 5, 6, 0, 0, 1, 9, 5]])
I would like to quickly remove each row of the array where np.sum(row[2:5]) == 0
The only way I can think to do this is with for loops, but that takes very long when there are millions of rows. Additionally, this needs to be constrained to Python 2.7

Boolean expressions can be used as an index. You can use them to mask the array.
inputarray = array([[ 2, 4, 0, 0, 0, 5, 9, 0],
[ 2, 3, 0, 1, 0, 3, 1, 1],
[ 1, 5, 4, 3, 2, 7, 8, 3],
[ 0, 7, 0, 0, 0, 6, 4, 4],
...,
[ 6, 5, 6, 0, 0, 1, 9, 5]])
mask = numpy.sum(inputarray[:,2:5], axis=1) != 0
result = inputarray[mask,:]
What this is doing:
inputarray[:, 2:5] selects all the columns you want to sum over
axis=1 means we're doing the sum on the columns
We want to keep the rows where the sum is not zero
The mask is used as a row index and selects the rows where the boolean expression is True

Another solution would be to use numpy.apply_along_axis to calculate the sums and cast it as a bool, and use that for your index:
my_arr = np.array([[ 2, 4, 0, 0, 0, 5, 9, 0],
[ 2, 3, 0, 1, 0, 3, 1, 1],
[ 1, 5, 4, 3, 2, 7, 8, 3],
[ 0, 7, 0, 0, 0, 6, 4, 4],])
my_arr[np.apply_along_axis(lambda x: bool(sum(x[2:5])), 1, my_arr)]
array([[2, 3, 0, 1, 0, 3, 1, 1],
[1, 5, 4, 3, 2, 7, 8, 3]])
We just cast the sum too a bool since any number that's not 0 is going to be True.

>>> a
array([[2, 4, 0, 0, 0, 5, 9, 0],
[2, 3, 0, 1, 0, 3, 1, 1],
[1, 5, 4, 3, 2, 7, 8, 3],
[0, 7, 0, 0, 0, 6, 4, 4],
[6, 5, 6, 0, 0, 1, 9, 5]])
You are interested in columns 2 through five
>>> a[:,2:5]
array([[0, 0, 0],
[0, 1, 0],
[4, 3, 2],
[0, 0, 0],
[6, 0, 0]])
>>> b = a[:,2:5]
You want to find the sum of those columns in each row
>>> sum_ = b.sum(1)
>>> sum_
array([0, 1, 9, 0, 6])
These are the rows that meet your criteria
>>> sum_ != 0
array([False, True, True, False, True], dtype=bool)
>>> keep = sum_ != 0
Use boolean indexing to select those rows
>>> a[keep, :]
array([[2, 3, 0, 1, 0, 3, 1, 1],
[1, 5, 4, 3, 2, 7, 8, 3],
[6, 5, 6, 0, 0, 1, 9, 5]])
>>>

Related

Python numpy: Add elements of a numpy array of arrays to elements of another array of arrays initialized to at the specified positions

Suppose we have a numpy array of numpy arrays of zeros as
arr1=np.zeros((len(Train),(L))
where Train is a (dataset) numpy array of arrays of integers of fixed length.
We also have another 1d numpy array, positions of length as len(Train).
Now we wish to add elements of Train to arr1 at the positions specified by positions.
One way is to use a for loop on the Train array as:
k=len(Train[0])
for i in range(len(Train)):
arr1[i,int(positions[i]):int((positions[i]+k))]=Train[i,0:k])]
However, going over the entire Train set using the explicit for loop is slow and I would like to optimize it.
Here is one way by generating all the indexes you want to assign to. Setup:
import numpy as np
n = 12 # Number of training samples
l = 8 # Number of columns in the output array
k = 4 # Number of columns in the training samples
arr = np.zeros((n, l), dtype=int)
train = np.random.randint(10, size=(n, k))
positions = np.random.randint(l - k, size=n)
Random example data:
>>> train
array([[3, 4, 3, 2],
[3, 6, 4, 1],
[0, 7, 9, 6],
[4, 0, 4, 8],
[2, 2, 6, 2],
[4, 5, 1, 7],
[5, 4, 4, 4],
[0, 8, 5, 3],
[2, 9, 3, 3],
[3, 3, 7, 9],
[8, 9, 4, 8],
[8, 7, 6, 4]])
>>> positions
array([3, 2, 3, 2, 0, 1, 2, 2, 3, 2, 1, 1])
Advanced indexing with broadcasting trickery:
rows = np.arange(n)[:, None] # Shape (n, 1)
cols = np.arange(k) + positions[:, None] # Shape (n, k)
arr[rows, cols] = train
output:
>>> arr
array([[0, 0, 0, 3, 4, 3, 2, 0],
[0, 0, 3, 6, 4, 1, 0, 0],
[0, 0, 0, 0, 7, 9, 6, 0],
[0, 0, 4, 0, 4, 8, 0, 0],
[2, 2, 6, 2, 0, 0, 0, 0],
[0, 4, 5, 1, 7, 0, 0, 0],
[0, 0, 5, 4, 4, 4, 0, 0],
[0, 0, 0, 8, 5, 3, 0, 0],
[0, 0, 0, 2, 9, 3, 3, 0],
[0, 0, 3, 3, 7, 9, 0, 0],
[0, 8, 9, 4, 8, 0, 0, 0],
[0, 8, 7, 6, 4, 0, 0, 0]])

Cumulative count of duplicate values by certain axis in Numpy array

Let's say I have this numpy array:
array([[4, 5, 6, 8, 5, 6],
[5, 1, 1, 9, 0, 5],
[7, 0, 5, 8, 0, 5],
[9, 2, 3, 8, 2, 3],
[1, 2, 2, 9, 2, 8]])
And going row by row, I would like to see, by column, the cumulative count of the number that appears. So for this array, the result would be:
array([[0, 0, 0, 0, 0, 0], # (*0)
[0, 0, 0, 0, 0, 0], # (*1)
[0, 0, 0, 1, 1, 1], # (*2)
[0, 0, 0, 2, 0, 0], # (*3)
[0, 1, 0, 1, 1, 0]] # (*4)
(*0): first time each value appears
(*1): all values are different from the previous one (in the column)
(*2): For the last 3 columns, a 1 appears because there is already 1 value repetition.
(*3): For the 4th column, a 2 appears because it's the 3rd time that a 8 appears.
(*4): In the 4th column, a 1 appears because it's the 2nd time that a 9 appears in that column. Similarly, for the second and second to last column.
Any idea how to perform this?
Thanks!
Maybe there is a faster way using numpy ufuncs, however here is a solution using standard python:
from collections import defaultdict
import numpy as np
a = np.array([[4, 5, 6, 8, 5, 6],
[5, 1, 1, 9, 0, 5],
[7, 0, 5, 8, 0, 5],
[9, 2, 3, 8, 2, 3],
[1, 2, 2, 9, 2, 8]])
# define function
def get_count(array):
count = []
for row in array.T:
occurences = defaultdict(int)
rowcount = []
for n in row:
occurences[n] += 1
rowcount.append(occurences[n] - 1)
count.append(rowcount)
return np.array(count).T
Output:
>>> get_count(a)
array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 2, 0, 0],
[0, 1, 0, 1, 1, 0]])

Python numpy array -- close smallest regions

I have a 2D boolean numpy array that represents an image, on which I call skimage.measure.label to label each segmented region, giving me a 2D array of int [0,500]; each value in this array represents the region label for that pixel. I would like to now remove the smallest regions. For example, if my input array is shape (n, n), I would like all labeled regions of < m pixels to be subsumed into the larger surrounding regions. For example if n=10 and m=5, my input could be,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1
0, 0, 0, 0, 0, 0, 0, 1, 1, 1
0, 0, 7, 8, 0, 0, 0, 1, 1, 1
0, 0, 0, 0, 0, 0, 0, 1, 1, 1
0, 0, 0, 0, 0, 2, 2, 2, 1, 1
4, 4, 4, 4, 2, 2, 2, 2, 1, 1
4, 6, 6, 4, 2, 2, 2, 3, 3, 3
4, 6, 6, 4, 5, 5, 5, 3, 3, 5
4, 4, 4, 4, 5, 5, 5, 5, 5, 5
4, 4, 4, 4, 5, 5, 5, 5, 5, 5
and the output is then,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1
0, 0, 0, 0, 0, 0, 0, 1, 1, 1
0, 0, 0, 0, 0, 0, 0, 1, 1, 1 # 7 and 8 are replaced by 0
0, 0, 0, 0, 0, 0, 0, 1, 1, 1
0, 0, 0, 0, 0, 2, 2, 2, 1, 1
4, 4, 4, 4, 2, 2, 2, 2, 1, 1
4, 4, 4, 4, 2, 2, 2, 3, 3, 3 # 6 is gone, but 3 remains
4, 4, 4, 4, 5, 5, 5, 3, 3, 5
4, 4, 4, 4, 5, 5, 5, 5, 5, 5
4, 4, 4, 4, 5, 5, 5, 5, 5, 5
I've looked into skimage morphology operations, including binary closing, but none seem to work well for my use case. Any suggestions?
You can do this by performing a binary dilation on the boolean region corresponding to each label. By doing this you will find the number of neighbours for each region. Using this you can then replace values as needed.
For an example code:
import numpy as np
import scipy.ndimage
m = 5
arr = [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 7, 8, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 2, 2, 2, 1, 1],
[4, 4, 4, 4, 2, 2, 2, 2, 1, 1],
[4, 6, 6, 4, 2, 2, 2, 3, 3, 3],
[4, 6, 6, 4, 5, 5, 5, 3, 3, 5],
[4, 4, 4, 4, 5, 5, 5, 5, 5, 5],
[4, 4, 4, 4, 5, 5, 5, 5, 5, 5]]
arr = np.array(arr)
nval = np.max(arr) + 1
# Compute number of occurances of each number
counts, _ = np.histogram(arr, bins=range(nval + 1))
# Compute the set of neighbours for each number via binary dilation
c = np.array([scipy.ndimage.morphology.binary_dilation(arr == i)
for i in range(nval)])
# Loop over the set of arrays with bad count and update them to the most common
# neighbour
for i in filter(lambda i: counts[i] < m, range(nval)):
arr[arr == i] = np.argmax(np.sum(c[:, arr == i], axis=1))
Which gives the expected result:
>>> arr.tolist()
[[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 2, 2, 2, 1, 1],
[4, 4, 4, 4, 2, 2, 2, 2, 1, 1],
[4, 4, 4, 4, 2, 2, 2, 3, 3, 3],
[4, 4, 4, 4, 5, 5, 5, 3, 3, 5],
[4, 4, 4, 4, 5, 5, 5, 5, 5, 5],
[4, 4, 4, 4, 5, 5, 5, 5, 5, 5]]

append element to the end of each arrow in numpy array

If i have an numpy array like:
x= [[3, 3], [2, 2]]
I want to add an element -1 to the end of each the rows to be like this:
x= [[3, 3, -1], [2, 2, -1]]
any simple way to do that ?
A simple way would be with np.insert -
np.insert(x,x.shape[1],-1,axis=1)
We can also use np.column_stack -
np.column_stack((x,[-1]*x.shape[0]))
Sample run -
In [161]: x
Out[161]:
array([[0, 8, 7, 0, 1],
[0, 1, 8, 6, 8],
[3, 4, 7, 0, 2]])
In [162]: np.insert(x,x.shape[1],-1,axis=1)
Out[162]:
array([[ 0, 8, 7, 0, 1, -1],
[ 0, 1, 8, 6, 8, -1],
[ 3, 4, 7, 0, 2, -1]])
In [163]: np.column_stack((x,[-1]*x.shape[0]))
Out[163]:
array([[ 0, 8, 7, 0, 1, -1],
[ 0, 1, 8, 6, 8, -1],
[ 3, 4, 7, 0, 2, -1]])

How to check that a matrix contains a zero column?

I have a large matrix, I'd like to check that it has a column of all zeros somewhere in it. How to do that in numpy?
Here's one way:
In [19]: a
Out[19]:
array([[9, 4, 0, 0, 7, 2, 0, 4, 0, 1, 2],
[0, 2, 0, 0, 0, 7, 6, 0, 6, 2, 0],
[6, 8, 0, 4, 0, 6, 2, 0, 8, 0, 3],
[5, 4, 0, 0, 0, 0, 0, 0, 0, 3, 8]])
In [20]: (~a.any(axis=0)).any()
Out[20]: True
If you later decide that you need the column index:
In [26]: numpy.where(~a.any(axis=0))[0]
Out[26]: array([2])
Create an equals 0 mask (mat == 0), and run all on it along an axis.
(mat == 0).all(axis=0).any()

Categories