I am new to the Computer Vision field and looking for your guidance to identify approach to tackle the following scenario:
What approach to follow to do Quality Control on small and thin metal rings using Computer Vision
Putting below the detailed requirement(this is the best I can share):
To begin with, I have attached a picture of the ring we need to do QC of.
Ring_for_QC
Ring diameter = 3 inch
Following checks we need to do:
1.Surface coating of the ring peeled off
2.Portion of ring chipped off
3.Scratch on the ring's Surface
4.Width of the ring is uneven
5.Dent on the ring
6.Entire surface of the ring is not completely horizontal to the plane;
may be due to some dent a part of the ring is resting on the plane surface creating some 1 or 2 degree angle
(I have marked no.6 as 'uneven surface' in the attached picture)
I have also attached another picture marking the quality issues found on a random ring.elevated view with marked QC issues
Scenario:
One single ring can have one or more than one of the above mentioned 6 defects
Issue 1 & 3 can occur at either surface of the ring and we need to check both the surfaces
We need to QC on one single ring at a time
Challenge:
- Need to set up a work station to capture image or video of each ring under check
How many cameras will be there in that work station and what would be the angle for the camera
As we need to check both the sides of the ring we need to decide whether:
we will place the ring on a trasperent surface and take image
or
we need to flip the ring after image is taken on one side
Next challenge is what computer vision technique we should employ to identify all these issues
For the time being we are doing some research around opencv's background substraction methods
It will be helpful to get some insight from you on
what should be a better/feasible approach
Since this is for a student project I'll emphasize image processing more than other aspects of an application. See the bottom section for considerations for real-world applications.
That aside, a general comment: implementing vision for quality control (QC) is hard to get right. If the product to be inspected is cheap (e.g. a ring, a small plastic thing), and if the result of the vision inspection is a borderline pass/fail, or uncertain, you can reject the part. If the part to be inspected is expensive (e.g. a large assembly for a tractor, individual CPUs, medical devices near the end of the production line), then you have to have very well defined specifications, and the system needs to be made as robust as possible.
In general, you want to optimize imaging for each type of defect. For example, the camera location, lens, and lighting to detect scratches may be quite different than what is needed for dimensional gauging (a.k.a. dimensional measurement).
Machine Vision vs. Computer Vision
When you search online for algorithms, equipment, and techniques specific to vision for industrial automation, including the quality control of parts on production lines, then for English-language websites favor the term "machine vision" instead of "computer vision."
https://en.wikipedia.org/wiki/Machine_vision
Machine vision is the common industry term for image processing (+ cameras + lighting + ...) for industrial use. Although different people may use different terminology, and the terminology isn't as important as learning techniques, you'll find a lot of material by searching for "machine vision." The term "computer vision" tends to be used for non-industrial applications, and for academic research, though in languages other than English the terms "machine vision" and "computer vision" may be the same. By comparison, "medical imaging" is similar to machine vision, but involves application of image processing to medical applications.
Lighting
Most importantly, you must control the lighting. Ambient lighting, such as desk lamps, overhead lights, etc., are not only useless for a vision system inspecting parts in production, but will typically interfere with image processing. You might find some defects sometimes with poorly controlled light, but to generate the most consistent results, you'll need to set up lights in specific locations, run the lights at specific, verifiable intensities, and have your vision system detect when something has gone wrong with the lighting.
There are "machine vision lights" designed especially for specific applications such as finding scratches in shiny surfaces, making shiny surfaces look less shiny, to backlight parts (which is useful for dimensional gauging), to illuminate parts from low angles, and so on. Read about different types of lighting.
https://smartvisionlights.com/
https://www.vision-systems.com/content/dam/VSD/solutionsinvision/Resources/lighting_tips_white_paper.pdf
Rather than spend a lot of money on special lights, you can mock them up:
LED flashlight or single LED (as a "point" light source)
Bright light + translucent sheet of plastic (for backlighting)
White tissue paper or some other diffusing material in front of a bright light
...
The importance of lighting can not be underestimated. Controlling lighting conditions improves the chance of success, and is typically necessary to achieve the accuracy of measurement or pass/fail assessment required in real-world environments.
Accuracy, Correctness, Usefulness
At some point you'll probably wonder whether machine learning is useful or necessary for the application. The question to ask yourself (or the customer) is this: what percentage of defects would need to be detected?
For example, if a chip is missing from the ring that could be a fatal defect. Is the ring used in some safety-critical application? If so, vision inspection for QC would have to be extraordinarily robust.
Even if you're familiar with the terms "accuracy" and "precision," make sure they have very clear meanings as you consider image processing problems:
https://en.wikipedia.org/wiki/Accuracy_and_precision
So, what percentage of chip defects needs to be found? 90%? 95%? 98%?
Using the term "accurate" more loosely to mean "the vision system gets the measurement correct and/or finds the defects we know are there," what is the accuracy of the most accurate machine learning algorithm you've read about? Or at least, what would qualify as reasonably impressive accuracy for machine learning? 95%? 98%?
If you're making measurements of machine parts on a production line, then you would typically want the accuracy of dimensional measurements and defect detection to be 99% or better. For high-value products, and products such as electronic components that are highly sensitive to defects, accuracy may need to be 99.999% or better. Think of it this way: if a manufacturer is making thousands or tens of thousands of parts, they don't want garbage parts to make it past your vision system several times a day.
Machine learning for image processing has been around a long time. Processing speeds, memory, and training set sizes have improved, and there have been improvements in algorithms as well, but it's important to note that machine learning is suitable only for some applications, and will fail miserably at other applications.
Techniques
To begin with, I have attached a picture of the ring we need to do QC
of.
Ring_for_QC
Ring diameter = 3 inch
Get the exact diameter, including tolerances. If the nominal diameter is 3.000 inches, then then tolerance might be expressed in terms of thousands of an inch. You may not need to know that for a student project, but if you were proposing a solution for a factory owner you wouldn't want to even suggest a price or timeline for delivery without having complete specs for the part, and numerous samples of the part.
From the one image it's not possible to be too specific about what a defect might look like--the same part can have different defects in different factories, or even on different production lines of the same factory--but we can make some guesses.
1.Surface coating of the ring peeled off
From the one image it's not clear what the surface coating is supposed to look like, or what's underneath. You must provide at least one image of a good part, and at least one image for each type of defect.
What is the surface coating? Anodization? Paint? Enamel? Plastic? Cheese? Whatever the case, knowing what material it is, and how that material degrades, will give some clues about what sort of vision setup may help detect problems with the coating. Changes in coating quality can affect apparent texture (e.g. edge content), brightness/darkness (intensity), color, shininess, and so on.
For the moment, let's assume the coating peeling off changes the brightness or texture of the uncoated surface vs. the remaining coated surface. Then your image processing might look something like the following:
Determine whether a ring is in the image
Segment the ring from the background. That is, use an algorithm such as connected components (OpenCV's findContours()), SIFT, or some other technique to identify the presence and location of a rigid object of known size and shape from the background.
Isolate further processing to just those pixels corresponding to the surface of the part.
Use some technique to find clusters of different texture differences, brightness differences, etc. This is where a better description of the coating is required. If lighting and lens parameters are "fixed," you can consider generating a histogram of brightness values in the image (0 = black, 255 = white) and then comparing the histogram of good parts and bad parts--is there some statistical difference? Or you might use connected components (findContours() again) to cluster pixels of different colors, assuming the lack of coating changes the apparent color of the part: maybe the coating is brown and the part is silvery.
It's hard to guess what technique would be relevant here without photos and/or a much more specific description of the coating. Hopefully this makes it clear why specs are important.
Coatings can be absent in different ways: peeling, small absences (voids), partially scraped away, etc. It can be difficult to predict in advance what the shape and size of missing coating may be.
When the size and shape of a defect is hard to predict, but when the defect is associated with a difference in image intensity (pixel brightness) or color, then explore these ideas:
Generate an "edge image" in which you find brightness/color transitions. You start with the grayscale or color image, then use Sobel or Canny or some other algorithm to generate an image of edge intensities.
Apply statistical methods to determine how "edgy" an image is. Are there more than N pixels (or more than 5% of all pixels) with an edge strength greater than S?
Once you have some basic algorithm that identifies the difference between good parts and parts with some missing coating, then you could consider using machine learning to review lots (lots!) of samples to help determine the best parameterization. For example, how do you know what number of edge pixels or edge pixel strength should be considered "bad"?
2.Portion of ring chipped off
It depends on whether the chip is visible just from the part's outline. For example, if you placed the part on a light table (a.k.a. "backlight"), would you always see a defect considered to be a "chip"? Or could the chip just be on the top surface facing the camera?
To find chips on edges, having the part on a backlight simplifies matters greatly.
Identify the location and orientation of the part (e.g. using connect components, normalized correlation, SIFT, or whatever algorithm is suitable for the part and accuracy of location required).
Find edges corresponding to the outer and inner rings of the part.
Fit a circle or nearly circle ellipse to the edge points using Hough circle fit, RANSAC circle fit, or (meh) least square circle fit parameterized to the known dimensions (in pixels) of the outer ring and inner rind diameters.
For the points used for the circle fits, find the point-to-circle (or point-to-ellipse) shortest distance. The larger this distance, the more likely you have a chip or missing chunk.
To ensure you're finding identations, chips, or whatever, and not just individual "noise" edge points, examine points in order going clockwise or anticlockwise, and only consider a series of perimeter points as defects if N successive points have a median or possibly mean point-to-edge distance greater than N.
A simpler approach could be to fit a black-and-white mask--a template--representing a good part to the current location and rotation of the part to be inspected. If the template and sample part are aligned very precisely, and if you perform image subtraction, then you may be fortunate enough to get clusters or pixels where there are defects. But this method is fairly crude, and harder to make robust.
There are machine learning techniques to identify chips on edges, but you'd need lots of part samples to train the techniques. Optionally, if you don't have enough samples, you can use the sample samples with slightly modified lighting, at different locations in the image, with manually added defects, etc., to help train the algorithm. But that's another discussion altogether.
3.Scratch on the ring's Surface
See the link above about different types of lighting. You'll need to experiment with a few different lighting configurations to figure out what works for your part.
Generally, though, scratches are likely to have difference in brightness and "edginess" (image edge content) relative to the rest of the part. If you're lucky, a scratch can reveal a different color.
Scratches can vary so much in appearance, area, and shape that it would be hard to parameterize an algorithm to catch them all. Once again, statistical analysis of edge content, brightness, and color tends to be useful.
In general: to achieve the best results for a particular QC inspection, you'll need to engineer a system specifically for the part. Your vision system may be configurable, and there can be different combinations of lights and cameras for different types of QC inspection, but for any particular defect detection you want to control the appearance of the part as much as possible. Relying on software to do all the work yields a less robust system that customers will typically yank out and throw away.
4.Width of the ring is uneven
This is almost an example of dimensional gauging or optical gauging. If you're just looking for unevenness, you don't necessarily need to measurement diameter in engineering units such as millimeters: you can just measure pixels. BUT the effort required to ensure your measurement in pixels is accurate will typically lead you to measuring in millimeters anyway.
Assuming the optical setup is correct and (more or less) calibrated, which I'll describe below, here's a basic process:
Identify the position and location of the part
From the algorithm that find the part, or from a follow-on algorithm that identifies edge pixels (e.g. Sobel, Canny, ...), find the edge pixels just for the outer diameter of the ring.
Perform a circle/ellipse fit to the edge pixels, and eliminate outlier pixels that don't actually belong to the circle/ellipse.
Have your algorithm start with the 1st pixel in the list of edge pixels corresponding to the outer diameter.
From that 1st pixel, find the edge pixel farthest away. Ideally, this would be the point diametrically opposite.
Cycle through all pixels, finding the distance to the farthest pixel. (This is not optimal in terms of speed, but simpler to code.)
Generate a histogram of all distances.
Make a determination of good/bad based on the histogram of point-to-point distances.
You might call a part "bad" for one or more of the following conditions:
At least N point-to-point distances exceed a distance of P pixels
The standard deviation of point-to-point distances exceeds some threshold T
...
Measurement of distance depends on the consistency of point-to-point distances at different locations within the image. If you perform accurate, precise measurements of distance, you'll notice that an object of fixed length appears to vary in length depending on its location in the image: if the object is located in the center of the image it may appear to be 57.5 pixels long, but in one corner of the image it may appear to be 56.2 pixels long.
To correct for these irregularities, you can...
Perform a nonlinear flatness correction. This will also correct for non-normal alignment of the camera to part, though you want to start with the optical axis of the camera as normal (perpendicular) to the surface of the part as possib.e.
Make a few quick measurements to estimate how much measurements vary.
5.Dent on the ring
6.Entire surface of the ring is not completely horizontal to the plane; may be due to some dent a part of the ring is resting on the
plane surface creating some 1 or 2 degree angle (I have marked no.6 as
'uneven surface' in the attached picture)
Use cameras imaging from the sides. Make sure the background is simple.
A 1- to 2-degree difference could be hard to detect using a camera placed directed overhead. If you're lucky you could detect that the outer edge of the part is more elliptical than circular, but the ability to detect this would depend on the color and thickness of the part. Also, you wouldn't necessarily be able to distinguish between a misshapen part and one resting at an angle--but for some inspections that's okay since both are defects.
HOWEVER, in a real-world application the customer might not be happy if you reject parts that are otherwise good, but happen to be sitting at a slight angle. A mechanical fixture might fix the problem by ensure parts are lying flat.
I have also attached another picture marking the quality issues found
on a random ring.elevated view with marked QC issues
The image isn't clear enough. Put the part on a simpler background and tinker with lighting to make it more obvious what the differences are between good and bad.
One single ring can have one or more than one of the above mentioned 6 defects
Run one algorithm after the other. You may also have to turn different lights on and off before running each algorithm (or rather, each chain of algorithms).
Issue 1 & 3 can occur at either surface of the ring and we need to check both the surfaces
We need to QC on one single ring at a time
You may have to write an algorithm to detect whether multiple rings happen to be present. Even if you weren't asked to do this specifically, this happens in production, and your professor may surprise you with it. At least have an idea how you would detect the presence of multiple rings.
That's another aspect of vision: you may start thinking of what algorithms and lighting are necessary to solve "the problem," but you'll also spend a lot of time figuring out everything that could go wrong, and writing software to detect those conditions to ensure you don't yield a false result. For example, what happens if the lights turn off? What if two rings are present? What if the ring isn't fully within the field of view? What if dirt gets on the surface the part is resting on? What if the lens gets dirty (which it will)?
A few principles:
Provide the best image for image processing before you consider what algorithm would work best.
Understand what accuracy/success rate is necessary, and measure it.
Get as many samples as you possibly can: hundreds, thousands if possible. Having a chance to measure "online" (in real production) is helpful.
Real-world applications
If it were a real-world application--that is, if you went into the field of vision professionally--there are many more steps that may seem less difficult, but that turn out to be critical:
How rings come into view (or into "station"): on a moving conveyor? placed by a robot? in some container?
What triggers vision inspection of the ring -- a programmable logic controller, a "light curtain" the ring passes through, or whether the vision system itself has to determine when a ring is ready for inspection.
How results are communicated to other equipment. (This can be a huge hassle, and an otherwise good vision system can be rejected by a customer if communications aren't designed and implemented properly.)
Whether you are guaranteed to see only one ring at a time
This isn't to say university isn't the real world: just that you probably won't lose tens or hundreds of thousands of Euros/pounds/dollars if you happen to overlook something.
You can see how to makes face recognition.
Face detection.
Face alignment and normalization.
Features extraction.
Comparing features with pattern.
But in your case, you can skip paragraph 3 and compare 2 with the reference image. Depending on the conditions, additional filtering may be necessary.
I am trying to detect specific objects in images using Haar cascade in OpenCV.
Let's say I am interested in detecting stop signs in landscape images. When defining positive image samples for my training set, which would be the best kind of image: (a) full images with my object, (b) a medium crop or (c) a tight crop?
Similarly, what's best for negative images? Does this influence overfitting? I would also appreciate any other general tips from those with experience. Thanks.
Image ref: http://kaitou-ace.deviantart.com/art/Stop-sign-on-a-country-road-Michigan-271990933
You only want features that you want to detect in your positive samples. So the C image would be correct for positive samples.
As for negative samples you want EVERYTHING else. Although that is obviously unrealistic if you are using your detector in a specific environment then training to detect that as negative is the right way to go. I.e. lots of pictures of landscapes etc (ones that don't have stop signs in)
The best choice is (c) because (a) and (b) contain too many features, all around the border of the sign, that are not interesting for you.
Not only they are not useful but they can seriously compromise the performance of the algorithm.
In case (c) its aim is to recognize situations where in the current window there are the features you are looking for.
But what about (b) and (c)?
In those cases the algorithm has to detect interesting features just in a corner of the window (and unfortunately that corner could be everywhere) and at the same time to be consistent with all the infinite possibilities that could occur around that corner.
You would need a huge amount of samples and anyway, even if you finally manage to get an acceptable hit rate, the job of separating positives and negatives is so difficult that the running time would be very high.
As to negatives collection, ideally you should pick up images that reproduce what you think are the images against which your final detector will run.
For example if you think that indoor images are not interesting for this, just discard them. If you think that a certain kind of landscapes are the ones where you detector will run, just retain much of them.
But this is only theoretical, I feel that the improvement would be negligeble. Just collect as many images as you can, The number of different images, that really matters.
Shamelessly jumping on the bandwagon :-)
Inspired by How do I find Waldo with Mathematica and the followup How to find Waldo with R, as a new python user I'd love to see how this could be done. It seems that python would be better suited to this than R, and we don't have to worry about licenses as we would with Mathematica or Matlab.
In an example like the one below obviously simply using stripes wouldn't work. It would be interesting if a simple rule based approach could be made to work for difficult examples such as this.
I've added the [machine-learning] tag as I believe the correct answer will have to use ML techniques, such as the Restricted Boltzmann Machine (RBM) approach advocated by Gregory Klopper in the original thread. There is some RBM code available in python which might be a good place to start, but obviously training data is needed for that approach.
At the 2009 IEEE International Workshop on MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP 2009) they ran a Data Analysis Competition: Where's Wally?. Training data is provided in matlab format. Note that the links on that website are dead, but the data (along with the source of an approach taken by Sean McLoone and colleagues can be found here (see SCM link). Seems like one place to start.
Here's an implementation with mahotas
from pylab import imshow
import numpy as np
import mahotas
wally = mahotas.imread('DepartmentStore.jpg')
wfloat = wally.astype(float)
r,g,b = wfloat.transpose((2,0,1))
Split into red, green, and blue channels. It's better to use floating point arithmetic below, so we convert at the top.
w = wfloat.mean(2)
w is the white channel.
pattern = np.ones((24,16), float)
for i in xrange(2):
pattern[i::4] = -1
Build up a pattern of +1,+1,-1,-1 on the vertical axis. This is wally's shirt.
v = mahotas.convolve(r-w, pattern)
Convolve with red minus white. This will give a strong response where the shirt is.
mask = (v == v.max())
mask = mahotas.dilate(mask, np.ones((48,24)))
Look for the maximum value and dilate it to make it visible. Now, we tone down the whole image, except the region or interest:
wally -= .8*wally * ~mask[:,:,None]
imshow(wally)
And we get !
You could try template matching, and then taking down which produced the highest resemblance, and then using machine learning to narrow it more. That is also very difficult, and with the accuracy of template matching, it may just return every face or face-like image. I am thinking you will need more than just machine learning if you hope to do this consistently.
maybe you should start with breaking the problem into two smaller ones:
create an algorithm that separates people from the background.
train a neural network classifier with as many positive and negative examples as possible.
those are still two very big problems to tackle...
BTW, I would choose c++ and open CV, it seems much more suited for this.
This is not impossible but very difficult because you really have no example of a successful match. There are often multiple states(in this case, more examples of find walleys drawings), you can then feed multiple pictures into an image reconization program and treat it as a hidden markov model and use something like the viterbi algorithm for inference ( http://en.wikipedia.org/wiki/Viterbi_algorithm ).
Thats the way I would approach it, but assuming you have multiple images that you can give it examples of the correct answer so it can learn. If you only have one picture, then I'm sorry there maybe another approach you need to take.
I recognized that there are two main features which are almost always visible:
the red-white striped shirt
dark brown hair under the fancy cap
So I would do it the following way:
search for striped shirts:
filter out red and white color (with thresholds on the HSV converted image). That gives you two mask images.
add them together -> that's the main mask for searching striped shirts.
create a new image with all the filtered out red converted to pure red (#FF0000) and all the filtered out white converted to pure white (#FFFFFF).
now correlate this pure red-white image with a stripe pattern image (i think all the waldo's have quite perfect horizontal stripes, so rotation of the pattern shouldn't be necessary). Do the correlation only inside the above mentioned main mask.
try to group together clusters which could have been resulted from one shirt.
If there are more than one 'shirts', to say, more than one clusters of positive correlation, search for other features, like the dark brown hair:
search for brown hair
filter out the specific brown hair color using the HSV converted image and some thresholds.
search for a certain area in this masked image - not too big and not too small.
now search for a 'hair area' that is just above a (before) detected striped shirt and has a certain distance to the center of the shirt.
Here's a solution using neural networks that works nicely.
The neural network is trained on several solved examples that are marked with bounding boxes indicating where Wally appears in the picture. The goal of the network is to minimize the error between the predicted box and the actual box from training/validation data.
The network above uses Tensorflow Object Detection API to perform training and predictions.
I am currently working on a system for robust hand detection.
The first step is to take a photo of the hand (in HSV color space) with the hand placed in a small rectangle to determine the skin color. I then apply a thresholding filter to set all non-skin pixels to black and all skin pixels white.
So far it works quite well, but I wanted to ask if there is a better way to solve this? For example, I found a few papers mentioning concrete color spaces for caucasian people, but none with a comparison for asian/african/caucasian color-tones.
By the way, I'm working with OpenCV via Python bindings.
Have you taken a look at the camshift paper by Gary Bradski? You can download it from here
I used the the skin detection algorithm a year ago for detecting skin regions for hand tracking and it is robust. It depends on how you use it.
The first problem with using color for tracking is that it is not robust to lighting variations or like you mentioned, when people have different skin tones. However this can be solved easily as mentioned in the paper by:
Convert image to HSV color space.
Throw away the V channel and consider the H and S channel and hence
discount for lighting variations.
Threshold pixels with low saturation due to their instability.
Bin the selected skin region into a 2D histogram. (OpenCV"s calcHist
function) This histogram now acts as a model for skin.
Compute the "backprojection" (i.e. use the histogram to compute the "probability"
that each pixel in your image has the color of skin tone) using calcBackProject. Skin
regions will have high values.
You can then either use meanShift to look for the mode of the 2D
"probability" map generated by backproject or to detect blobs of
high "probability".
Throwing away the V channel in HSV and only considering H and S channels is really enough (surprisingly) to detect different skin tones and under different lighting variations. A plus side is that its computation is fast.
These steps and the corresponding code can be found in the original OpenCV book.
As a side note, I've also used Gaussian Mixture Models (GMM) before. If you are only considering color then I would say using histograms or GMM makes not much difference. In fact the histogram would perform better (if your GMM is not constructed to account for lighting variations etc.). GMM is good if your sample vectors are more sophisticated (i.e. you consider other features) but speed-wise histogram is much faster because computing the probability map using histogram is essentially a table lookup whereas GMM requires performing a matrix computation (for vector with dimension > 1 in the formula for multi-dimension gaussian distribution) which can be time consuming for real time applications.
So in conclusion, if you are only trying to detect skin regions using color, then go with the histogram method. You can adapt it to consider local gradient as well (i.e. histogram of gradients but possibly not going to the full extent of Dalal and Trigg's human detection algo.) so that it can differentiate between skin and regions with similar color (e.g. cardboard or wooden furniture) using the local texture information. But that would require more effort.
For sample source code on how to use histogram for skin detection, you can take a look at OpenCV"s page here. But do note that it is mentioned on that webpage that they only use the hue channel and that using both hue and saturation would give better result.
For a more sophisticated approach, you can take a look at the work on "Detecting naked people" by Margaret Fleck and David Forsyth. This was one of the earlier work on detecting skin regions that considers both color and texture. The details can be found here.
A great resource for source code related to computer vision and image processing, which happens to include code for visual tracking can be found here. And not, its not OpenCV.
Hope this helps.
Here is a paper on adaptive gaussian mixture model skin detection that you might find interesting.
Also, I remember reading a paper (unfortunately I can't seem to track it down) that used a very clever technique, but it required that you have the face in the field of view. The basic idea was detect the person's face, and use the skin patch detected from the face to identify the skin color automatically. Then, use a gaussian mixture model to isolate the skin pixels robustly.
Finally, Google Scholar may be a big help in searching for state of the art in skin detection. It's heavily researched in adademia right now as well as used in industry (e.g., Google Images and Facebook upload picture policies).
I have worked on something similar 2 years ago. You can try with Particle Filter (Condensation), using skin color pixels as input for initialization. It is quite robust and fast.
The way I applied it for my project is at this link. You have both a presentation (slides) and the survey.
If you initialize the color of the hand with the real color extracted from the hand you are going to track you shouldn't have any problems with black people.
For particle filter I think you can find some code implementation samples. Good luck.
It will be hard for you to find skin tone based on color only.
First of all, it depends strongly on the automatic white balance algorithm.
For example, in this image, any person can see that the color is skin tone. But for the computer it will be blue.
Second, correct color calibration in digital cameras is a hard thing, and it will be rarely accurate enough for your purposes.
You can see www.DPReview.com, to understand what I mean.
In conclusion, I truly believe that the color by itself can be an input, but it is not enough.
Well my experience with the skin modeling are bad, because:
1) lightning can vary - skin segmentation is not robust
2) it will mark your face also (as other skin-like objects)
I would use machine learning techniques like Haar training, which, in my opinion, if far more better approach than modeling and fixing some constraints (like skin detection + thresholding...)
As more robust then pixel colour you can use hand geometry model. First project model for particular gesture and the cross-correlate it with source image. Here is demo of this tchnique.