Improving background subtraction when encountering split objects in foreground masks

Improving background subtraction when encountering split objects in foreground masks - python

For a project I implemented a simple background subtraction using a median background estimation. The result is not bad, but often moving objects (people in my test examples) are cut in unconnected blobs.
I tried calling open and close operations (I removed the close operation, because it seemed as if it wouldn't improve the result) on the foreground mask to improve the result, which worked to some degree. However, I am wondering if there are even more ways how I could improve the foreground mask. It is still a fair bit away from the ground truth.
I am aware that playing around with the threshold itself is also always a viable solution and I do play around with that too. That being said, I focus on reducing noise to a minimum. I also tried adaptive thresholding, but that didn't look very promising for this usecase.
Without opening:
With opening:
I am more looking for general approaches than to actual implementations.
Pseudocode of background subtraction
Greyscale all images.
Make a background estimation by calculating the median for every r,g and b value for every pixel in a subset of all images.
Then take every image and calculate the absolute difference between that image and the background estimation.
Apply a threshold to get a binary result called the foreground mask
Use opencvs open operation once.

I like the greyscale simplification.
Simple is good.
We should make everything as simple as
possible, but not simpler.
Let's attack your model for a moment.
An evil clothing designer with an army
of fashion models sends them walking
past your camera, each wearing a red
shirt that is slightly darker than
the preceding one.
At least one of the models will be
"invisible" against some of your background pixels,
having worn a matching shade,
with matching illumination,
compared with the median pixel value.
Repeat with a group of green shirts, then blue.
How to remedy this?
In each channel compute the median red,
median green, median blue pixel intensity.
At inference time, compute three absolute
value differences.
Threshold on max of those deltas.
Computing over sensor R, G, B is straightforward.
Human perception more closely aligns with H, S, V.
Consider computing max delta over those three,
or over all six.
For each background pixel,
compute both expected value and variance,
either for the whole video or for one-minute
slots of time.
Now, at inference time, the variance
informs your thresholding decision,
improving its accuracy.
For example, some pixels may have constant
illumination, others slowly change with the
movement of the sun, and others are quite
noisy when wind disturbs the leaves of
vegetation. The variance lets you capture
this quite naturally.
For a much more powerful modeling approach,
go with Optical Flow.
https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html

Related

Identifying positive pixels after color deconvolution ignoring boundaries

I am analyzing histology tissue images stained with a specific protein marker which I would like to identify the positive pixels for that marker. My problem is that thresholding on the image gives too much false positives which I'd like to exclude.
I am using color deconvolution (separate_stains from skimage.color) to get the AEC channel (corresponding to the red marker), separating it from the background (Hematoxylin blue color) and applying cv2 Otsu thresholding to identify the positive pixels using cv2.threshold(blur,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU), but it is also picking up the tissue boundaries (see white lines in the example picture, sometimes it even has random colors other than white) and sometimes even non positive cells (blue regions in the example picture). It's also missing some faint positive pixels which I'd like to capture.
Overall: (1) how do I filter the false positive tissue boundaries and blue pixels? and (2) how do I adjust the Otsu thresholding to capture the faint red positives?
Adding a revised example image -
top left the original image after using HistoQC to identify tissue regions and apply the mask it identified on the tissue such that all of the non-tissue regions are black. I should tru to adjust its parameters to exclude the folded tissue regions which appear more dark (towards the bottom left of this image). Suggestions for other tools to identify tissue regions are welcome.
top right hematoxylin after the deconvolution
bottom left AEC after the deconvolution
bottom right Otsu thresholding applied not the original RGB image trying to capture only the AEC positives pixels but showing also false positives and false negatives
Thanks

#cris-luengo thank you for your input on scikit-image! I am one of the core developers, and based on #assafb input, we are trying to rewrite the code on color/colorconv/separate_stains.
#Assafb: The negative log10 transformation is the Beer-Lambert mapping. What I don't understand in that code is the line rgb += 2. I don't know where that comes from or why they use it. I'm 100% sure it is wrong. I guess they're trying to avoid log10(0), but that should be done differently. I bet this is where your negative values come from, though.
Yes, apparently (I am not the original author of this code) we use rgb += 2 to avoid log10(0). I checked Fiji's Colour Deconvolution plugin, and they add 1 to their input. I tested several input numbers to help on that, and ~2 would let us closer to the desirable results.
#Assafb: Compare the implementation in skimage with what is described in the original paper. You'll see several errors in the implementation, most importantly the lack of a division by the max intensity. They should have used -np.log10(rgb/255) (assuming that 255 is the illumination intensity), rater than -np.log10(rgb).
Our input data is float; the max intensity in this case would be 1. I'd say that that's the reason we don't divide by something.
Besides that, I opened an issue on scikit-image to discuss these problems — and to specify a solution. I made some research already — I even checked DIPlib's documentation —, and implemented a different version of that specific function. However, stains are not my main area of expertise, and we would be glad if you could help evaluating that code — and maybe pointing a better solution.
Thank you again for your help!

There are several issues that cause improper quantification. I'll go over the details of how I would recommend you tackle these slides.
I'm using DIPlib, because I'm most familiar with it (I'm an author). It has Python bindings, which I use here, and can be installed with pip install diplib. However, none of this is complicated image processing, and you should be able to do similar processing with other libraries.
Loading image
There is nothing special here, except that the image has strong JPEG compression artifacts, which can interfere with the stain unmixing. We help the process a bit by smoothing the image with a small Gaussian filter.
import diplib as dip
import numpy as np
image = dip.ImageRead('example.png')
image = dip.Gauss(image, [1]) # because of the severe JPEG compression artifacts
Stain unmixing
[Personal note: I find it unfortunate that Ruifrok and Johnston, the authors of the paper presenting the stain unmixing method, called it "deconvolution", since that term already had an established meaning in image processing, especially in combination with microscopy. I always refer to this as "stain unmixing", never "deconvolution".]
This should always be the first step in any attempt at quantifying from a bightfield image. There are three important RGB triplets that you need to determine here: the RGB value of the background (which is the brightness of the light source), and the RGB value of each of the stains. The unmixing process has two components:
First we apply the Beer-Lambert mapping. This mapping is non-linear. It converts the transmitted light (as recorded by the microscope) into absorbance values. Absorbance indicates how strongly each point on the slide absorbs light of the various wavelengths. The stains absorb light, and differ by the relative absorbance in each of the R, G and B channels of the camera.
background_intensity = [209, 208, 215]
image = dip.BeerLambertMapping(image, background_intensity)
I manually determined the background intensity, but you can automate that process quite well if you have whole slide images: in whole slide images, the edges of the image always correspond to background, so you can look there for intensities.
The second step is the actual unmixing. The mixing of absorbances is a linear process, so the unmixing is solving of a set of linear equations at each pixel. For this we need to know the absorbance values for each of the stains in each of the channels. Using standard values (as in skimage.color.hax_from_rgb) might give a good first approximation, but rarely will provide the best quantification.
Stain colors change from assay to assay (for example, hematoxylin has a different color depending on who made it, what tissue is stained, etc.), and change also depending on the camera used to image the slide (each model has different RGB filters). The best way to determine these colors is to prepare a slide for each stain, using all the same protocol but not putting on the other dyes. From these slides you can easily obtain stain colors that are valid for your assay and your slide scanner. This is however rarely if ever done in practice.
A more practical solution involves estimating colors from the slide itself. By finding a spot on the slide where you see each of the stains individually (where stains are not mixed) one can manually determine fairly good values. It is possible to automatically determine appropriate values, but is much more complex and it'll be hard finding an existing implementation. There are a few papers out there that show how to do this with non-negative matrix factorization with a sparsity constraint, which IMO is the best approach we have.
hematoxylin_color = np.array([0.2712, 0.2448, 0.1674])
hematoxylin_color = (hematoxylin_color/np.linalg.norm(hematoxylin_color)).tolist()
aec_color = np.array([0.2129, 0.2806, 0.4348])
aec_color = (aec_color/np.linalg.norm(aec_color)).tolist()
stains = dip.UnmixStains(image, [hematoxylin_color, aec_color])
stains = dip.ClipLow(stains, 0) # set negative values to 0
hematoxylin = stains.TensorElement(0)
aec = stains.TensorElement(1)
Note how the linear unmixing can lead to negative values. This is a result of incorrect color vectors, noise, JPEG artifacts, and things on the slide that absorb light that are not the two stains we defined.
Identifying tissue area
You already have a good method for this, which is applied to the original RGB image. However, don't apply the mask to the original image before doing the unmixing above, keep the mask as a separate image. I wrote the next bit of code that finds tissue area based on the hematoxylin stain. It's not very good, and it's not hard to improve it, but I didn't want to waste too much time here.
tissue = dip.MedianFilter(hematoxylin, dip.Kernel(5))
tissue = dip.Dilation(tissue, [20])
tissue = dip.Closing(tissue, [50])
area = tissue > 0.2
Identifying tissue folds
You were asking about this step too. Tissue folds typically appear as larger darker regions in the image. It is not trivial to find an automatic method to identify them, because a lot of other things can create darker regions in the image too. Manual annotation is a good start, if you collect enough manually annotated examples you could train a Deep Learning model to help you out. I did this just as a place holder, again it's not very good, and identifies some positive regions as folds. Folds are subtracted from the tissue area mask.
folds = dip.Gauss(hematoxylin - aec, [20])
area -= folds > 0.2
Identifying positive pixels
It is important to use a fixed threshold for this. Only a pathologist can tell you what the threshold should be, they are the gold-standard for what constitutes positive and negative.
Note that the slides must all have been prepared following the same protocol. In clinical settings this is relatively easy because the assays used are standardized and validated, and produce a known, limited variation in staining. In an experimental setting, where assays are less strictly controlled, you might see more variation in staining quality. You will even see variation in staining color, unfortunately. You can use automated thresholding methods to at least get some data out, but there will be biases that you cannot control. I don't think there is a way out: inconsistent stain in, inconsistent data out.
Using an image-content-based method such as Otsu causes the threshold to vary from sample to sample. For example, in samples with few positive pixels the threshold will be lower than other samples, yielding a relative overestimation of the percent positive.
positive = aec > 0.1 # pick a threshold according to pathologist's idea what is positive and what is not
pp = 100 * dip.Count(dip.And(positive, area)) / dip.Count(area)
print("Percent positive:", pp)
I get a 1.35% in this sample. Note that the % positive pixels is not necessarily related to the % positive cells, and should not be used as a substitute.

I ended up incorporating some of the feedback given above by Chris into the following possible unconventional solution for which I would appreciate getting feedback (to the specific questions below but also general suggestions for improvement or more effective/accurate tools or strategy):
Define (but not apply yet) tissue mask (HistoQC) after optimizing HistoQC script to remove as much of the tissue folds as possible without removing normal tissue area
Apply deconvolution on the original RGB image using hax_from_rgb
Using the second channel which should correspond to the red stain pixels, and subtract from it the third channel which as far as I see corresponds to the background non-red/blue pixels of the image. This step removes the high values in the second channel that which up because of tissue folds or other artifacts that weren't removed in the first step (what does the third channel correspond to? The Green element of RGB?)
Blur the adjusted image and threshold based on the median of the image plus 20 (Semi-arbitrary but it works. Are there better alternatives? Otsu doesn't work here at all)
Apply the tissue regions mask on the thresholded image yielding only positive red/red-ish pixels without the non-tissue areas
Count the % of positive pixels relative to the tissue mask area
I have been trying to apply, as suggested above, the tissue mask on the deconvolution red channel output and then use Otsu thresholding. But it failed since the black background generated by the applying the tissue regions mask makes the Otsu threshold detect the entire tissue as positive. So I have proceeded instead to apply the threshold on the adjusted red channel and then apply the tissue mask before counting positive pixels. I am interested in learning what am I doing wrong here.
Other than that, the LoG transformation didn't seem to work well because it produced a lot of stretched bright segments rather than just circular blobs where cells are located. I'm not sure why this is happening.

Use ML for this case.
Create manually binary mask for your pictures: each red pixel - white, background pixels - black.
Work in HSV or Lab color space.
Train simple classifier: decision tree or SVM (linear or with RBF)..
Let's test!
See on a good and very simple example with skin color segmentation.
And in the future you can add new examples and new cases without code refactoring: just update dataset and retrain model.

OpenCV find subjective contours like the human eye does

When humans see markers suggesting the form of a shape, they immediately perceive the shape itself, as in https://en.wikipedia.org/wiki/Illusory_contours. I'm trying to accomplish something similar in OpenCV in order to detect the shape of a hand in a depth image with very heavy noise. In this question, assume that skin color based detection is not working (actually it is the best I've achieved so far but it is not robust under changing light conditions, shadows or skin colors. Also various paper shapes (flat and colorful) are on the table, confusing color-based approaches. This is why I'm attempting to use the depth cam instead).
Here's a sample image of the live footage that is already pre-processed for better contrast and with background gradient removed:
I want to isolate the exact shape of the hand from the rest of the picture. For a human eye this is a trivial thing to do. So here are a few attempts I did:
Here's the result with canny edge detection applied. The problem here is that the black shape inside the hand is larger than the actual hand, causing the detected hand to overshoot in size. Also, the lines are not connected and I fail at detecting contours.
Update: Combining Canny and a morphological closing (4x4 px ellipse) makes contour detection possible with the following result. It is still waaay too noisy.
Update 2: The result can be slightly enhanced by drawing that contour to an empty mask, save that in a buffer and re-detect yet another contour on a merge of three buffered images. The line that combines the buffered images is is hand_img = np.array(np.minimum(255, np.multiply.reduce(self.buf)), np.uint8) which is then morphed once again (closing) and finally contour detected. The results are slightly less horrible than in the picture above but laggy instead.
Alternatively I tried to use an existing CNN (https://github.com/victordibia/handtracking) for detecting the approximate position of the hand's center (this step works) and then flood from there. In order to detect contours the result is put into an OTSU filter and then the largest contour is taken, resulting in the following picture (ignore black rectangles in the left). The problem is that some of the noise is flooded as well and the results are mediocre:
Finally, I tried background removers such as MOG2 or GMG. They are confused by the enormous amount of fast-moving noise. Also they cut off the fingertips (which are crucial for this project). Finally, they don't see enough details in the hand (8 bit plus further color reduction via equalizeHist yield a very poor grayscale resolution) to reliably detect small movements.
It's ridiculous how simple it is for a human to see the exact precise shape of the hand in the first picture and how incredibly hard it is for the computer to draw a shape.
What would be your recommended method to achieve an exact hand segmentation?

After two days of desperate testing, the solution was to VERY carefully apply thresholding to an well-preprocessed image.
Here are the steps:
Remove as much noise as you possibly can. In my case, denoising was done using Intel's pyrealsense2 (I'm using an Intel RealSense depth camera and the algorithms were written for that camera family, thus they work very well). I used rs.temporal_filter() and directly after rs.hole_filling_filter() on every frame.
Capture the very first frame. Besides capturing the exact distance to the table (for later thresholding), this step also saves a still picture that is blurred by a 100x100 px kernel. Since the camera is never mounted perfectly but slightly tilted, there's an ugly grayscale gradient going over the picture and making operations impossible. This still picture is then subtracted from every single later frame, eliminating the gradient. BTW: this gradient removal step is already incorporated in the screenshots shown in the question above
Now the picture is almost noise-free. Do not use equalizeHist. This does not simply increase the general contrast regularly but instead empathizes the remaining noise way too much. This was my main error I did in almost all experiments. Instead, apply a threshold (binary with fixed border) directly. The border is extremely thin, setting it at 104 instead of 205 makes a huge difference.
Invert colors (unless you have taken BINARY_INV in the previous step), apply contours, take the largest one and write it to a mask
Voilà!

Erosion without losing regions

I have an image containing cells. I can't provide it, but it is similar to the image used as an example here: http://blogs.mathworks.com/steve/2006/06/02/cell-segmentation/ but without the characteristic nuclei.
I have done some processing and am now left with a pretty good segmentation, but some cells are close to each other and I need to split them. Most of them consist of more or less overlapping ellipses.
I am certain that a few iterations of simple erosion will split almost all of those regions. But some of the other cells are so small, they will disappear before the others split. Therefore I need an algorithm that erodes the image, allowing region splitting, but does not delete the last pixel of a region.
I want to use watershed afterwards to segment the cells.
I guess I could implement this on my own by searching for cennected regions and then tracking that I don't lose any or something like that, but the implementation seems messy even in my head and I think there must be an easier way. So my question is basically, what's the name of this so I can google an implementation? Or if there is no off-the-shelf solution, what's an elegant way of implementing this without dozens of iterations and for loops etc.
(Language is python)

It's a classical problem, and if the overlap between cells is too important, let's say 40% or more, then there is not a good solution.
However, if the overlap is not important, here is the solution:
You start from the segmentation you have, let's call it S
You computer the ultimate eroded UE(S). It will give you the center of each cell. It will give you something like the red points on this image. In this image, they use a distance map, an ultimate eroded will be more stable. If there are still many red points per cell, then a dilation of the UE(S) will fix your problem like this example.
You invert Inv(S) or compute the voronoi diagram Voi(S) in order to have a marker in the background.
Watershed on the gradient image of S, using the UE(S) as inner marker (perfect because you have one point by cell) and Inv(S) or Voi(S) as background/outer marker.
You will get something like this example.

Robust Hand Detection via Computer Vision

I am currently working on a system for robust hand detection.
The first step is to take a photo of the hand (in HSV color space) with the hand placed in a small rectangle to determine the skin color. I then apply a thresholding filter to set all non-skin pixels to black and all skin pixels white.
So far it works quite well, but I wanted to ask if there is a better way to solve this? For example, I found a few papers mentioning concrete color spaces for caucasian people, but none with a comparison for asian/african/caucasian color-tones.
By the way, I'm working with OpenCV via Python bindings.

Have you taken a look at the camshift paper by Gary Bradski? You can download it from here
I used the the skin detection algorithm a year ago for detecting skin regions for hand tracking and it is robust. It depends on how you use it.
The first problem with using color for tracking is that it is not robust to lighting variations or like you mentioned, when people have different skin tones. However this can be solved easily as mentioned in the paper by:
Convert image to HSV color space.
Throw away the V channel and consider the H and S channel and hence
discount for lighting variations.
Threshold pixels with low saturation due to their instability.
Bin the selected skin region into a 2D histogram. (OpenCV"s calcHist
function) This histogram now acts as a model for skin.
Compute the "backprojection" (i.e. use the histogram to compute the "probability"
that each pixel in your image has the color of skin tone) using calcBackProject. Skin
regions will have high values.
You can then either use meanShift to look for the mode of the 2D
"probability" map generated by backproject or to detect blobs of
high "probability".
Throwing away the V channel in HSV and only considering H and S channels is really enough (surprisingly) to detect different skin tones and under different lighting variations. A plus side is that its computation is fast.
These steps and the corresponding code can be found in the original OpenCV book.
As a side note, I've also used Gaussian Mixture Models (GMM) before. If you are only considering color then I would say using histograms or GMM makes not much difference. In fact the histogram would perform better (if your GMM is not constructed to account for lighting variations etc.). GMM is good if your sample vectors are more sophisticated (i.e. you consider other features) but speed-wise histogram is much faster because computing the probability map using histogram is essentially a table lookup whereas GMM requires performing a matrix computation (for vector with dimension > 1 in the formula for multi-dimension gaussian distribution) which can be time consuming for real time applications.
So in conclusion, if you are only trying to detect skin regions using color, then go with the histogram method. You can adapt it to consider local gradient as well (i.e. histogram of gradients but possibly not going to the full extent of Dalal and Trigg's human detection algo.) so that it can differentiate between skin and regions with similar color (e.g. cardboard or wooden furniture) using the local texture information. But that would require more effort.
For sample source code on how to use histogram for skin detection, you can take a look at OpenCV"s page here. But do note that it is mentioned on that webpage that they only use the hue channel and that using both hue and saturation would give better result.
For a more sophisticated approach, you can take a look at the work on "Detecting naked people" by Margaret Fleck and David Forsyth. This was one of the earlier work on detecting skin regions that considers both color and texture. The details can be found here.
A great resource for source code related to computer vision and image processing, which happens to include code for visual tracking can be found here. And not, its not OpenCV.
Hope this helps.

Here is a paper on adaptive gaussian mixture model skin detection that you might find interesting.
Also, I remember reading a paper (unfortunately I can't seem to track it down) that used a very clever technique, but it required that you have the face in the field of view. The basic idea was detect the person's face, and use the skin patch detected from the face to identify the skin color automatically. Then, use a gaussian mixture model to isolate the skin pixels robustly.
Finally, Google Scholar may be a big help in searching for state of the art in skin detection. It's heavily researched in adademia right now as well as used in industry (e.g., Google Images and Facebook upload picture policies).

I have worked on something similar 2 years ago. You can try with Particle Filter (Condensation), using skin color pixels as input for initialization. It is quite robust and fast.
The way I applied it for my project is at this link. You have both a presentation (slides) and the survey.
If you initialize the color of the hand with the real color extracted from the hand you are going to track you shouldn't have any problems with black people.
For particle filter I think you can find some code implementation samples. Good luck.

It will be hard for you to find skin tone based on color only.
First of all, it depends strongly on the automatic white balance algorithm.
For example, in this image, any person can see that the color is skin tone. But for the computer it will be blue.
Second, correct color calibration in digital cameras is a hard thing, and it will be rarely accurate enough for your purposes.
You can see www.DPReview.com, to understand what I mean.
In conclusion, I truly believe that the color by itself can be an input, but it is not enough.

Well my experience with the skin modeling are bad, because:
1) lightning can vary - skin segmentation is not robust
2) it will mark your face also (as other skin-like objects)
I would use machine learning techniques like Haar training, which, in my opinion, if far more better approach than modeling and fixing some constraints (like skin detection + thresholding...)

As more robust then pixel colour you can use hand geometry model. First project model for particular gesture and the cross-correlate it with source image. Here is demo of this tchnique.

Determine height of Coffee in the pot using Python imaging

We have a web-cam in our office kitchenette focused at our coffee maker. The coffee pot is clearly visible. Both the location of the coffee pot and the camera are static. Is it possible to calculate the height of coffee in the pot using image recognition? I've seen image recognition used for quite complex stuff like face-recognition. As compared to those projects, this seems to be a trivial task of measuring the height.
(That's my best guess and I have no idea of the underlying complexities.)
How would I go about this? Would this be considered a very complex job to partake? FYI, I've never done any kind of imaging-related work.

Since the coffee pot position is stationary, get a sample frame and locate a single column of pixels where the minimum and maximum coffee quantities can easily be seen, in a spot where there are no reflections. Check the green vertical line segment in the following picture:
(source: nullnetwork.net)
The easiest way is to have two frames, one with the pot empty, one with the pot full (obviously under the same lighting conditions, which typically would be the case), convert to grayscale (colorsys.rgb_to_hsv each RGB pixel and keep only the v (3rd) component) and sum the luminosity of all pixels in the chosen line segment. Let's say the pot-empty case reaches a sum of 550 and the pot-full case a sum of 220 (coffee is dark). By comparing an input frame sum to these two sums, you can have a rough estimate of the percentage of coffee in the pot.
I wouldn't bet my life on the accuracy of this method, though, and the fluctuations even from second to second might be wild :)
N.B: in my example, the green column of pixels should extend to the bottom of the pot; I just provided an example of what I meant.

Steps that I'd try:
Convert the image in grayscale.
Binarize the image, and leave only the coffee. You can discover a good threshold manually through experimentation.
Blob extraction. Blob's area (number of pixels) is one way to calculate the height, ie area / width.

First do thresholding, then segmentation. Then you can more easily detect edges.

You're looking for edge detection. But you only need to do it between the brown/black of the coffee and the color of the background behind the pot.

make pictures of the pot with different levels of coffe in it.
downsample the image to maybe 4*10 pixels.
make the same in a loop for each new live picture.
calculate the difference of each pixels value compared to the reference images.
take the reference image with the least difference sum and you get the state of your coffe machine.
you might experiment if a grayscale version or only red or green might give better results.
if it gives problems with different light settings this aproach is useless. just buy a spotlight for the coffe machine, or lighten up, or darken each picture till the sum of all pixels reaches a reference value.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.