Performant way to fill holes for categorical data?

Performant way to fill holes for categorical data? - python

We have 3D segmentation masks where every class has its own label / ID.
For every class we would like to fill holes in the segmentation.
For an example, the following matrix:
[
[
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 0, 3, 0, 0, 4, 0 ],
[ 3, 3, 3, 4, 0, 4 ],
[ 0, 3, 0, 0, 4, 0 ],
],
[
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 0, 1, 2, 0, 0 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 0, 3, 0, 0, 4, 0 ],
[ 3, 0, 3, 4, 0, 4 ],
[ 0, 3, 0, 0, 4, 0 ],
],
[
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 0, 3, 0, 0, 4, 0 ],
[ 3, 3, 3, 4, 4, 4 ],
[ 0, 3, 0, 0, 4, 0 ],
],
]
Should result in
[
[
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 0, 3, 0, 0, 4, 0 ],
[ 3, 3, 3, 4, 0, 4 ],
[ 0, 3, 0, 0, 4, 0 ],
],
[
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 0, 0 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 0, 3, 0, 0, 4, 0 ],
[ 3, 3, 3, 4, 0, 4 ],
[ 0, 3, 0, 0, 4, 0 ],
],
[
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 1, 1, 1, 2, 2, 2 ],
[ 0, 3, 0, 0, 4, 0 ],
[ 3, 3, 3, 4, 4, 4 ],
[ 0, 3, 0, 0, 4, 0 ],
],
]
The only filled holes are the 1 and 3 in the middle slice.
The 2 shape is open to the side and the 4 is open to the back.
The 0 between the classes should stay untouched.
I implemented 7 versions using the existing scipy.ndimage.morphology.binary_fill_holes function (or its implementation) and numpy. Here the two best versions so far:
import numpy as np
from scipy.ndimage.morphology import binary_fill_holes, label, generate_binary_structure, binary_dilation
def fill_holes6(img: np.ndarray, applied_labels: np.ndarray) -> np.ndarray:
output = np.zeros_like(img)
for i in applied_labels:
output[binary_fill_holes(img == i)] = i
return output
def fill_holes7(img: np.ndarray, applied_labels: np.ndarray) -> np.ndarray:
output = np.zeros(img.shape, dtype=int)
for i in applied_labels:
tmp = np.zeros(img.shape, dtype=bool)
binary_dilation(tmp, structure=None, iterations=-1, mask=img != i, origin=0, border_value=1, output=tmp)
output[np.logical_not(tmp)] = i
return output
# EDIT: Added the following method:
def fill_holes8(img: np.ndarray, applied_labels: np.ndarray) -> np.ndarray:
connectivity = 1
footprint = generate_binary_structure(img.ndim, connectivity)
background_mask = img == 0
components, num_components = label(background_mask, structure=footprint)
filled_holes = np.zeros_like(img)
for component_label in range(1, num_components + 1):
component_mask = components == component_label
component_neighborhood = np.pad(img, 1, constant_values=-1)[binary_dilation(np.pad(component_mask, 1), structure=footprint)]
neighbor_labels = np.unique(component_neighborhood)
if len(neighbor_labels) == 2 and -1 not in neighbor_labels:
neighbor_label = neighbor_labels[1]
filled_holes[component_mask] = neighbor_label
return img + filled_holes
I measured the performance the following way (matching my real world data distribution):
import time
import pandas as pd
def measure(funs, t):
res = []
for _ in range(t):
ra = np.random.randint(10, 40)
sh = np.random.randint(200, 400, 3)
img = np.random.randint(0, ra, sh)
applied_labels = np.unique(img)[1:]
fun_res = []
for fun in funs:
start = time.time()
fun(img, applied_labels)
end = time.time()
fun_res.append(end - start)
res.append(fun_res)
return np.min(res, axis=0), np.max(res, axis=0), np.mean(res, axis=0), np.std(res, axis=0)
print(measure([fill_holes6, fill_holes7], t=10))
For my first implementations I got the following execution times (t=100):
fill_holes1
fill_holes2
fill_holes3
min
6.4s
6.9s
6.2s
max
83.7s
96.0s
80.4s
mean
32.9s
37.3s
31.6s
std
17.3s
20.1s
16.5
This is very slow.
The last implementation fill_holes7 is only 1.27 times faster than fill_holes3.
Is there a more performant way of doing this?
Opened a feature request on the scipy project first but was asked to go to stackoverflow first: https://github.com/scipy/scipy/issues/14504
EDIT:
I also opened a feature request on the MONAI project. See #2678
For this I opened a pull request with the iterative erosion solution (fill_holes7).
You can find the documentation here: monai.transforms.FillHoles
During this I also implemented a connected component labeling (CCL) based version.
See the implementation in MONAI here.
I added fill_holes8 above, which is basically that implementation.
The MONAI package is happy for any pull request improving the performance of this method. Feel free to go there, open an issue and a pull request.

binary_fill_holes is not very efficiently implemented: it seems to not make use of SIMD instruction and is not parallelized. It is also based on a pretty intensive algorithm (iterative erosion). Since this function is run for each label, your final implementation is very computationally intensive. One solution to fix the performance issue is to redesign your algorithm.
A first step is to keep the iteration over the label and find a more efficient way to fill hole. One efficient solution is to use a flood-fill algorithm on each cell of the border not yet filled and then look for the unfilled cells. These remaining cells should be either holes or cell already set with the current label. Such an algorithm should be quite fast. However, it is not easy to implement it efficiently in Python. There are some implementations of flood-fill in Python (eg. in skimage.morphology), but the cost of calling the function from Python for most border cell would be too high.
An alternative solution is to use a label algorithm to find all the region of the array that are connected each other. This can easily be done using label of skimage.measure. Once labeled, the labeled border regions can be set as not being holes. The remaining one as either holes or regions already set with the right label. This solution is more intensive, especially when the number of label is big (which seems quite rare based on your example and as long as each labels are computed separately). Here is an implementation:
from skimage.measure import label
def getBorderLabels(img):
# Detection
lab0 = np.unique(img[:,:,0])
lab1 = np.unique(img[:,:,-1])
lab2 = np.unique(img[:,0,:])
lab3 = np.unique(img[:,-1,:])
lab4 = np.unique(img[0,:,:])
lab5 = np.unique(img[-1,:,:])
# Reduction
lab0 = np.union1d(lab0, lab1)
lab2 = np.union1d(lab2, lab3)
lab4 = np.union1d(lab4, lab5)
return np.union1d(np.union1d(lab0, lab2), lab4)
def getHoleLabels(borderLabels, labelCount):
return np.setdiff1d(np.arange(1, labelCount+1, dtype=int), borderLabels, assume_unique=True)
def fill_holes8(img: np.ndarray, applied_labels: np.ndarray) -> np.ndarray:
output = img.copy()
for i in applied_labels:
labelized, labelCount = label(img==i, background=True, return_num=True, connectivity=1)
holeLabels = getHoleLabels(getBorderLabels(labelized), labelCount)
if len(holeLabels) > 0:
output[np.isin(labelized, holeLabels)] = i
return output
This implementation is about 3 times faster on my machine.
Note that it is possible to parallelize the algorithm (eg. using multiple processes) by working on multiple label at the same time. However, one should care to not use too much memory and not write in output in the correct order (similar to the sequential algorithm).
The biggest source of slow-down comes from the separate computation of each label. Once can tune the flood-fill algorithm to write a custom well-optimized implementation fitting you need although this appear to be pretty hard to do. Alternatively, one can tune the label-based implementation to do the same. The second approach is simpler, but not easy either. Many question arise in complex cases: what should happen when cells with a given label L1 form a boundary containing a hole itself containing other cells with a given label L2 forming a boundary containing another hole? What if the boundaries overlap partially each other? Are such cases possible? Should they be investigated, and if yes, what would be the set of accepted outputs?
As long as the labeled boundaries are not forming tricky cases, there is a quite efficient algorithm to track and fill holes with the right labels. I am not sure it always work but here is the idea:
Use a label algorithm to find all connected regions
Build a set of label containing all the labels
Remove the labels associated with border regions from the set
Remove the labels associated with cells already labelled (ie. non-zero cells)
So far, the remaining labels are either holes, fake-holes (or tricky cases assumed not present). Fake-holes are unlabelled cells surrounded by labelled cells with multiple different labels (like the 0 cells in the middle of your example).
Check the label of the cell on the boundaries of each labelled regions. If a labelled region is only surrounded by cells with the same label L3, then it is a hole that must be filled with L3-label cells. Otherwise, this is either a fake-hole (or a tricky case).
The resulting algorithm should be much faster than the reference implementation and the previous one.

Related

Comparing two dimensional arrays to one another

I want to write a code where it outputs the similarities for the values of arrays a,b,c. I want the code to check if there are any similar values between the arrays. I will be comparing b and c to a. So [ 0, 1624580882] exist when comparing a and b and so on. Both the columns must be equivalent for the comparison to work.
import numpy as np
a= np.array([[ 0, 1624580882],
[ 1, 1624584458],
[ 0, 1624589467],
[ 1, 1624592213],
[ 0, 1624595336],
[ 1, 1624596349]])
b= np.array([[ 1, 1624580882],
[ 1, 1624584460],
[ 1, 1624595336],
[ 1, 1624596349]])
c = np.array([[ 0, 1624580882],
[ 1, 1624584458],
[ 0, 1624589495],
[ 1, 1624592238],
[ 0, 1624595336],
[ 1, 1624596349]])
Expected Output:
b comparison
Similarities= None
c comparison
Similarities= [ 0, 1624580882],[ 1, 1624584464], [ 0, 1624595350],[ 1, 1624596380]

I'm not giving you the actual solution rather I can help you with a simple function. You can design the rest of your code according to that function.
def compare_arrays(arr_1, arr_2):
result = []
for row in arr_1:
result.append(row in arr_2)
return result
Edit:
For getting the index of the duplicate values.
from numpy.lib import recfunctions as rfn
ndtype = [('a', int)]
a = np.ma.array([1, 1, 1, 2, 2, 3, 3],mask=[0, 0, 1, 0, 0, 0, 1]).view(ndtype)
rfn.find_duplicates(a, ignoremask=True, return_index=True)

not the most beautiful solution. But the first thing that comes to mind:
result = []
for row in a:
for irow in c:
if np.all(np.equal(row, irow)):
result.append(row)
break
I note that the proposed by Fatin Ishrak Rafi solution does not work. For example:
>>> [0, 1624589467] in c
>>> True

find which number in a list sum up to a certain number but with negative and decimals

I have a quite big listed number that includes negatives, 2nd placed decimal numbers. For example, (10348.94, -984.23, 9429.92). I want to find the sum of a number that adds up from one in one of the list. Also the number in the list can be repeated, and the given sum can be negative.
Here is what I got so far, the repetition and the decimal seems to work but when I try to do a negative numbers both in the list and the given sum it wouldn't work.
def Find(goal, VarienceNum):
variance = [[Listed] for Listed in VarienceNum]
newList = []
result = []
while variance:
for holder in variance:
s = sum(holder)
for Listed in VarienceNum:
if Listed >= holder[-1]:
if s + Listed < goal:
newList.append(holder + [Listed])
elif s + Listed == goal:
result.append(holder + [Listed])
variance = newList
newList = []
return result
goal=float(input("please enter your goal: "))
VarienceNum=list(map(float,input("please enter the list: ").split()))
print(Find(goal,VarienceNum))
here's the output

Get all subsets of the list, check the sum of each subset, and when that sum finally matches the target value return that subset!
def inc_bool_array(arr, ind=0):
if (ind >= len(arr)): return;
if (arr[ind] == 0):
arr[ind] = 1;
else:
arr[ind] = 0;
inc_bool_array(arr, ind + 1);
def find_subset_sum(target, arr):
size = len(arr);
pick = [ 0 for n in arr ];
num_subsets = 2 ** size;
'''
Loop through every possible subset until we find one such that
`sum(subset) == target`
'''
for n in range(num_subsets):
''' Subset is determined by the current boolean values in `pick` '''
subset = [ arr[ind] for ind in range(size) if pick[ind] == 1 ];
if sum(subset) == target: return subset;
''' Update `pick` to the next set of booleans '''
inc_bool_array(pick);
return None;
print(find_subset_sum(3, [ 1, 2, 3 ]));
print(find_subset_sum(5, [ 1, 2, 3 ]));
print(find_subset_sum(6, [ 1, 2, 3 ]));
print(find_subset_sum(7, [ 1, 2, 3 ]));
print(find_subset_sum(3, [ -1, 5, 8 ]));
print(find_subset_sum(4, [ -1, 5, 8 ]));
print(find_subset_sum(5, [ -1, 5, 8 ]));
print(find_subset_sum(6, [ -1, 5, 8 ]));
print(find_subset_sum(7, [ -1, 5, 8 ]));
print(find_subset_sum(8, [ -1, 5, 8 ]));
print(find_subset_sum(12, [ -1, 5, 8 ]));
print(find_subset_sum(13, [ -1, 5, 8 ]));
The hard part here is getting all possible subsets of the list. Getting all subsets is a matter of choosing "include" or "exclude" for every item in the list (2 options per element results in 2^n possible choices, and 2^n possible subsets).
In order to enumerate all these choices I use a simple array called pick which is composed of boolean values; one boolean value for each value in the source array. Each boolean represents an include/exclude choice for its corresponding value in the source array. The array starts full of only 0, representing the choice of "exclude" for each item. Then a function called inc_bool_array is used to update pick to the next set of values. This means pick will take on these values over time:
Step 1: [ 0, 0, 0, 0, 0, ... ]
Step 2: [ 1, 0, 0, 0, 0, ... ]
Step 3: [ 0, 1, 0, 0, 0, ... ]
Step 4: [ 1, 1, 0, 0, 0, ... ]
Step 5: [ 0, 0, 1, 0, 0, ... ]
Step 6: [ 1, 0, 1, 0, 0, ... ]
Step 7: [ 0, 1, 1, 0, 0, ... ]
Step 8: [ 1, 1, 1, 0, 0, ... ]
Step 9: [ 0, 0, 0, 1, 0, ... ]
.
.
.
Gradually every possible combination of 0s and 1s will occur. Then pick is used to generate a subset which only contains values corresponding to a 1, simply using a generator with an if condition:
subset = [ arr[ind] for ind in range(len(arr)) if pick[ind] == 1 ]

Scanning for groups of the same value in numpy array

I have a numpy array where 0 denotes empty space and 1 denotes that a location is filled. I am trying to find a quick method of scanning the numpy array for where there are multiple values of zero adjacent to each other and return the location of the central zero.
For Example if I had the following array
[0 1 0 1]
[0 0 0 1]
[0 1 0 1]
[1 1 1 1]
I want to return the locations for which there is an adjacent zero on either side of a central zero
e.g
[1,1]
as this is the central of 3 zeros, i.e there is a zero either side of the zero at this location
Im aware that this can be calculated using if statements, but wondered if there was a more pythonic way of doing this.
Any help is greatly appreciated

The desired output here for arbitrary inputs is not exhaustively specified in the question, but here is a possible approach that might be useful for this kind of problem, and adapted to the details of the desired output. It uses np.cumsum, np.bincount, np.where, and np.median to find the middle index for groups of consecutive zeros along rows of a 2D array:
import numpy as np
def find_groups(x, min_size=3, value=0):
# Compute a sequential label for groups in each row.
xc = (x != value).cumsum(1)
# Count the number of occurances per group in each row.
counts = np.apply_along_axis(
lambda x: np.bincount(x, minlength=1 + xc.max()),
axis=1, arr=xc)
# Filter by minimum number of occurances.
i, j = np.where(counts >= min_size)
# Compute the median index of each group.
return [
(ii, int(np.ceil(np.median(np.where(xc[ii] == jj)[0]))))
for ii, jj in zip(i, j)
]
x = np.array([[0, 1, 0, 1],
[0, 0, 0, 1],
[0, 1, 0, 1],
[1, 1, 1, 1]])
print(find_groups(x))
# [(1, 1)]
It should work properly even for multiple rows with groups of varying sizes, and even multiple groups per row:
x2 = np.array([[0, 1, 0, 1, 1, 1, 1],
[0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0]])
print(find_groups(x2))
# [(1, 1), (1, 5), (2, 3), (3, 3)]

Checking and infilling the central pixel with 8 pixel neighbourhood mode in a large array in Python

I have a big binary array (500 x 700) in which I want to check 'NaNs' and infill the central pixel with the mode of eight surrounding pixels (if more than 4 surrounding pixels have 0 or 1). It's more like a 3x3 sliding window search. Are there any tools/functions to do this in either xarray or scipy.ndimage or even numpy?
Eg.
arr = np.asarray([0, 1, 1, 1, 0, 1, 1, np.nan, 0, 1, 0, 1, 1, 1, 0, 1, 1, np.nan]).reshape(3,6)
arr[1,1] = 1
arr[-1,-1] = 1 (only 3 neighbours)
Any help would be highly appreciated..
Thanks in advance.

You can implement your idea directly using numpy and scipy.stats.mode.
First, find locations of nan values by comparing an array to itself because a NaN float is not equal to itself by definition. The np.where function will return all locations this condition holds, in two tuples of indices, one for the rows and the other for columns.
Then, for each location where a NaN is found, add 8 deltas to it to get its surrounding pixels. This can be done efficiently using a delta array, which lists all possible offsets for the row and column index, for each neighbour.
Finally, do a within-boundary check and run the mode function on the selected, valid neighbours and fill this value into the NaN cell.
Here's the code following my description above:
import numpy as np
import scipy.stats
arr = np.asarray([
0, 1, 1, 1, 0, 1,
1, np.nan, 0, 1, 0, 1,
1, 1, 0, 1, 1, np.nan
]).reshape(3, 6)
delta_rows = np.array([-1, -1, -1, 0, 0, 1, 1, 1])
delta_cols = np.array([-1, 0, 1, -1, 1, -1, 0, 1])
nan_rows, nan_cols = np.where(arr != arr)
for nan_row, nan_col in zip(nan_rows, nan_cols):
neighbour_rows = nan_row + delta_rows
neighbour_cols = nan_col + delta_cols
within_boundary = (
(0 <= neighbour_rows) & (neighbour_rows < arr.shape[0]) &
(0 <= neighbour_cols) & (neighbour_cols < arr.shape[1])
)
neighbour_rows = neighbour_rows[within_boundary]
neighbour_cols = neighbour_cols[within_boundary]
arr[nan_row, nan_col] = scipy.stats.mode(arr[neighbour_rows, neighbour_cols]).mode
Afterwards, we can see that each NaN value in arr is correctly populated with the mode of its surrounding cells:
>>> print(arr)
[[0. 1. 1. 1. 0. 1.]
[1. 1. 0. 1. 0. 1.]
[1. 1. 0. 1. 1. 1.]]

Comparing value with neighbor elements in numpy

Let's say I have a numpy array
a b c
A = i j k
u v w
I want to compare the value central element with some of its eight neighbor elements (along the axis or along the diagonal). Is there any faster way except the nested for loop (it's too slow for big matrix)?
To be more specific, what I want to do is compare value of element with it's neighbors and assign new values.
For example:
if (j == 1):
if (j>i) & (j>k):
j = 999
else:
j = 0
if (j == 2):
if (j>c) & (j>u):
j = 999
else:
j = 0
...
something like this.

Your operation contains lots of conditionals, so the most efficient way to do it in the general case (any kind of conditionals, any kind of operations) is using loops. This could be done efficiently using numba or cython. In special cases, you can implement it using higher level functions in numpy/scipy. I'll show a solution for the specific example you gave, and hopefully you can generalize from there.
Start with some fake data:
A = np.asarray([
[1, 1, 1, 2, 0],
[1, 0, 2, 2, 2],
[0, 2, 0, 1, 0],
[1, 2, 2, 1, 0],
[2, 1, 1, 1, 2]
])
We'll find locations in A where various conditions apply.
1a) The value is 1
1b) The value is greater than its horizontal neighbors
2a) The value is 2
2b) The value is greater than its diagonal neighbors
Find locations in A where the specified values occur:
cond1a = A == 1
cond2a = A == 2
This gives matrices of boolean values, of the same size as A. The value is true where the condition holds, otherwise false.
Find locations in A where each element has the specified relationships to its neighbors:
# condition 1b: value greater than horizontal neighbors
f1 = np.asarray([[1, 0, 1]])
cond1b = A > scipy.ndimage.maximum_filter(
A, footprint=f1, mode='constant', cval=-np.inf)
# condition 2b: value greater than diagonal neighbors
f2 = np.asarray([
[0, 0, 1],
[0, 0, 0],
[1, 0, 0]
])
cond2b = A > scipy.ndimage.maximum_filter(
A, footprint=f2, mode='constant', cval=-np.inf)
As before, this gives matrices of boolean values indicating where the conditions are true. This code uses scipy.ndimage.maximum_filter(). This function iteratively shifts a 'footprint' to be centered over each element of A. The returned value for that position is the maximum of all elements for which the footprint is 1. The mode argument specifies how to treat implicit values outside boundaries of the matrix, where the footprint falls off the edge. Here, we treat them as negative infinity, which is the same as ignoring them (since we're using the max operation).
Set values of the result according to the conditions. The value is 999 if conditions 1a and 1b are both true, or if conditions 2a and 2b are both true. Else, the value is 0.
result = np.zeros(A.shape)
result[(cond1a & cond1b) | (cond2a & cond2b)] = 999
The result is:
[
[ 0, 0, 0, 0, 0],
[999, 0, 0, 999, 999],
[ 0, 0, 0, 999, 0],
[ 0, 0, 999, 0, 0],
[ 0, 0, 0, 0, 999]
]
You can generalize this approach to other patterns of neighbors by changing the filter footprint. You can generalize to other operations (minimum, median, percentiles, etc.) using other kinds of filters (see scipy.ndimage). For operations that can be expressed as weighted sums, use 2d cross correlation.
This approach should be much faster than looping in python. But, it does perform unnecessary computations (for example, it's only necessary to compute the max when the value is 1 or 2, but we're doing it for all elements). Looping manually would let you avoid these computations. Looping in python would probably be much slower than the code here. But, implementing it in numba or cython would probably be faster because these tools generate compiled code.

I used numpy's:
concatenate to pad with zeroes
dstack and roll to align correctly
Apply custom_roll twice along different dimensions and subtract original.
import numpy as np
def custom_roll(a, axis=0):
n = 3
a = a.T if axis==1 else a
pad = np.zeros((n-1, a.shape[1]))
a = np.concatenate([a, pad], axis=0)
ad = np.dstack([np.roll(a, i, axis=0) for i in range(n)])
a = ad.sum(2)[1:-1, :]
a = a.T if axis==1 else a
return a
Consider the following ndarray:
A = np.arange(25).reshape(5, 5)
A
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
sum_of_eight_around_me = custom_roll(custom_roll(A), axis=1) - A
sum_of_eight_around_me
array([[ 12., 20., 25., 30., 20.],
[ 28., 48., 56., 64., 42.],
[ 53., 88., 96., 104., 67.],
[ 78., 128., 136., 144., 92.],
[ 52., 90., 95., 100., 60.]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performant way to fill holes for categorical data? - python

Related

Comparing two dimensional arrays to one another

find which number in a list sum up to a certain number but with negative and decimals

Scanning for groups of the same value in numpy array

Checking and infilling the central pixel with 8 pixel neighbourhood mode in a large array in Python

Comparing value with neighbor elements in numpy

Categories

Resources