I have trained tensorflow object detection api on my own dataset with 1 class using rfcn_resnet101 model. Firstly I used the raccoon dataset and trained for 264600 times and the detection result is weird, it can detect the object, but there are some other little boxes around the right box.
Then I use another dataset containing one class,and there are 80000 images in the dataset, I met the familiar phenomenon. I am very confused.
Have anyone ever met the same situation? What can I do to solve this problem? Thanks in advance!
I had the same behavior on the PASCAL VOC Dataset. I haven't fixed it, because I just implemented a model for kind of a proof-of-concept system. My guess is, that the model predicts the proposal regions and accepts them, if there is an IoU greater or equal to an defined threshold. So setting the nms_iou_threshold might solve the problem.
This Adoption also seems to fit your examples. All of the predicted bounding boxes seem to have an IoU with the groundtruth box.
Related
Regular TimesFormer takes 3 channel input images, while I have 4 channel images (RGBD). I am struggling to find a TimesFormer (or a model similar to TimesFormer) that takes 4 channel input images and extract features from them.
Does anybody know such a model? Preferably, I would like to find pretrained model with weights.
MORE CONTEXT:
I am working with RGBD video frames and have multiclass classification problem at the end. My videos are fairly large, between 2 to 4 minutes so classical time-series models doesn't work for me. So my inputs are RGBD frames/images from the video and at the end I would like to get class prediction.
My idea was to divide the problem into 2 stages:
Extract features from video into smaller dimension with TimesFormer-like model. Result: I would get a new data representation (dataset).
Train clasification ML network with new data to get a class prediction.
As of Jan 2023, I don't think there's any readily available TimeSformer model/code that works on RGBD 4 channel image.
Alternatively, If you are looking for Vision Transformers that can work with depth as well (RGBD data), you have the entire list with state-of-the-art approaches and corresponding code (wherever available) here.
One of the good approach to start with is DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. You can find the pre-trained models with this approach here.
If you're looking for 3D CNN based object detectors that can work on RGBD data: RGB-D Salient Object Detection via 3D Convolutional Neural Networks is one of the good ones to start with. Code and pre-trained models can be found here.
Since I don't fully understand your problem statement or exact requirement I'm proposed few things that I thought could be helpful.
I am using MRCNN in python to train 20 images (with annotated images info saved as json file) for object detection. The problem is that at the best case the loss is around 4 which it shows that the model has not learned well (the loss fluctuates a lot during the learning process for each epochs). Obviously, when using the trained model for detection the result is wrong, it means that it cannot detect the object and randomly selects some pixels as the object.
Can someone kindly help me how I can improve the performance and also some hints about initial weights if the object is not one of the objects in COCO database.
Thanks in advance.
I have a bunch of images (basically a resized subset of Celeba dataset). And I have a binary label for each of those images. The images are people's faces.
The problem is : I don't know to which characteristic correspond those labels.
Do you have any method to "backtest" features and labels ? I have no idea how to determine what those labels correspond to.
I have tried visualizing those images again and again trying to understand what were the similarities between the images without success.
I have tried SVM classifier and then plot the coefficients to determine where the classifier was targeted, without success.
Thank you for the help
No. From an information theory standpoint, there is no way to reconstruct an arbitrary, abstract concept from a set of observations. This is equivalent to most research using the scientific method: you're trying to guess the classification rationale from a finite set of data. A guess that does a reliable job of predicting observed behaviour is called a "theory".
Does that help frame the way you look at the problem you're facing?
I'm pretty new to object detection. I'm using tensorflow object detection API and I'm now collecting datasets for my project
and model_main.py to train my model.
I have found and transformed two quite large datasets of cars and traffic lights with annotations. And made two tfrecords from them.
Now I want to train a pretrained model however, I'm just curious will it work? When it is possible that an image for example "001.jpg" will have of course some annotated bounding boxes of cars (it is from the car dataset) but if there is a traffic light as well it wouldn't be annotated -> will it lead to bad learning rate? (there can be many of theese "problematic" images) How should I improve this? Is there any workaround? (I really don't want to annotate the images again)
If its stupid question I'm sorry, thanks for any response - some links with this problematic would be the best !
Thanks !
The short answer is yes, it might be problematic, but with some effort you can make it possible.
If you have two urban datasets, and in one you only have annotations for traffic lights, and in the second you only have annotations for cars, then each instance of car in the first dataset will be learned as false example, and each instance of traffic light in the second dataset will be learned as false example.
The two possible outcomes I can think of are:
The model will not converge, since it tries to learn opposite things.
The model will converge, but will be domain specific. This means that the model will only detect traffic lights on images from the domain of the first dataset, and cars on the second.
In fact I tried doing so myself in a different setup, and got this outcome.
In order to be able to learn your objective of learning traffic lights and cars no matter which dataset they come from, you'll need to modify your loss function. You need to tell the loss function from which dataset each image comes from, and then only compute the loss on the corresponding classes (/zero out the loss on the classes do not correspond to it). So returning to our example, you only compute loss and backpropagate traffic lights on the first dataset, and cars on the second.
For completeness I will add that if resources are available, then the better option is to annotate all the classes on all datasets in order to avoid the suggested modification, since by only backpropagating certain classes, you do not enjoy using actual false examples for other classes.
in my project I need to train a object detection model which is able to recognize hands in different poses in real-time from a rgb webcam.
Thus, I'm using the TensorFlow object detection API.
What I did so far is training the model based on the ssd_inception_v2 architecture with the ssd_inception_v2_coco model as finetune-checkpoint.
I want to detect 10 different classes of hand poses. Each class has 300 images which are augmented. In total there are 2400 images in for training and 600 for evaluation. The labeling was done with LabelImg.
The Problem is that the model isn't able to detect the different classes properly. Even if it still wasn't good I got much better results by training with the same images, but only with like 3 different classes. It seems like the problem is the SSD architecture. I've read several times that SSD networks are not good in detecting small objects.
Finally, my questions are the following:
Could I receive better results by using a faster_rcnn_inception architecture for this use case?
Has anyone some advice how to optimize the model?
Do I need more images?
Thanks for your answers!