For my project, I have to feed an image, which is an apparent resistivity model, into a convolutional neural network and output a corresponding "true" model image. The idea is like this: apparent resistivity model -> true model
Both sets of images are saved as tiff files in different folders. From what I understand I need to convert these to floating-point tensors to feed them into the CNN. However I'm confused as to the overall big picture - how do I feed both the apparent resistivity model and the true model into the CNN for training? Can I simply create a function that will extract the images from their directories (it is very important this is done in the correct corresponding order), convert them to arrays, normalise them, and then store them in lists? And then pass both sets of lists into the CNN as X_train and y_train?
After training, I will feed more apparent resistivity pictures into the CNN to see the image output and compare it to the "real" true model, in order to check the performance of the model. How will I get the CNN to output an image?
Sorry if these are questions are too vague, large or basic. The tutorials I've seen online all have to do with training a neural net for classification, which is not helpful to me. Thanks in advance.
Related
Regular TimesFormer takes 3 channel input images, while I have 4 channel images (RGBD). I am struggling to find a TimesFormer (or a model similar to TimesFormer) that takes 4 channel input images and extract features from them.
Does anybody know such a model? Preferably, I would like to find pretrained model with weights.
MORE CONTEXT:
I am working with RGBD video frames and have multiclass classification problem at the end. My videos are fairly large, between 2 to 4 minutes so classical time-series models doesn't work for me. So my inputs are RGBD frames/images from the video and at the end I would like to get class prediction.
My idea was to divide the problem into 2 stages:
Extract features from video into smaller dimension with TimesFormer-like model. Result: I would get a new data representation (dataset).
Train clasification ML network with new data to get a class prediction.
As of Jan 2023, I don't think there's any readily available TimeSformer model/code that works on RGBD 4 channel image.
Alternatively, If you are looking for Vision Transformers that can work with depth as well (RGBD data), you have the entire list with state-of-the-art approaches and corresponding code (wherever available) here.
One of the good approach to start with is DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. You can find the pre-trained models with this approach here.
If you're looking for 3D CNN based object detectors that can work on RGBD data: RGB-D Salient Object Detection via 3D Convolutional Neural Networks is one of the good ones to start with. Code and pre-trained models can be found here.
Since I don't fully understand your problem statement or exact requirement I'm proposed few things that I thought could be helpful.
Currently, I'm working with a dataset where I have two kinds of images: "sharp version" of the image and "blurry version" of the same images, where a blur was added synthetically. My goal is to train a model that takes the blurry version of the images in and tries to deblur the image as much as it can so that the "deblurred image" is closer to the sharp version. In the literature, the UNet architecture seemed to be a model with good results. Additionally, I can use a pre-trained U-Net via Pytorch (https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/).
My problem is now: When I train this pre-trained U-Net with my images and then try it on my test set, I get the following output:
The original image:
I know that this pre-trained model is usually used for biomedical image segmentation but I'm rather confused about how I have to modify the model to use it for an Image Deblurring/Reconstruction task. Does anyone have any advice on how to do this?
I would appreciate any feedback :)
The U-net you're using is for segmentation (classification of each pixels of the image) whereas you're trying to denoise the image (getting your image "sharper"/remove noise). It explains the results you got.
To get what you want you need and as DerekG said, you first need to modify the number of channels of the output. By modify it, you can't load all the pretrained model. You will have to copy parameters by parameters until the last one.
As the last layer is initialized randomly, you can retrained the model with your training set. You can freeze or not the pretrained parts.
Also, I'm not sure what your new dataset is but if it's really not related to biomedical images you should retrain your network from scratch (transfer learning shouldn't be done in these cases), maybe even change the encoder-decoder network.
From the included link:
import torch
model = torch.hub.load('mateuszbuda/brain-segmentation-pytorch', 'unet',
in_channels=3, out_channels=1, init_features=32, pretrained=True)
The model you define has a single output channel, resulting in a grayscale image output. You need 3 output channels for an RGB image.
Where can I find details to implement siamese networks to perform image similarity and to retrieve the most similar image from a dataset
It is difficult to get a large number of image data for all the classes, so only a few images, eg 10 images for some classes, are available for most of the classes.
SIFT or ORB seems to perform poorly on some classes.
My project is to differentiate between the license plates based on the states of the UAE. Here I upload few example images.
When there is few training data, no matter how annoying it sounds, the best approach is usually to collect more. Deep networks are infamously data hungry and their performance is poor when the data is scarce. This said, there are approaches that might help you:
Transfer learning
Data augmentation
In transfer learning, you take an already trained deep net (e.g. ResNet50), which was trained for some other task (e.g. ImageNet), fix all its network weights except for the weights in the last few layers and train on your task of interest.
Data augmentation slightly modifies your training data in some predictable way. In your case you can rotate your image by a small angle, apply a perspective transformation, scale the image intensities or slightly change the colors. You apply a different set of these operations with different parameters every time you want to use a particular training image. This way you generate new training examples enlarging your training set.
I am trying to extract features from the last layer of a GoogleNet caffe model fine-tuned on car classification. Here's the deploy.prototxt. I tried a couple of things:
I took features from 'loss3_classifier_model' layer which is incorrect.
Now I am extracting features from 'pool5' layer in the model as given in prototxt.
I am not sure whether it's correct or not because the features I am extracting for different cars doesn't seem to have much difference. In other words, I am unable to differentiate cars using this last layer features, I used Euclidean distance on features (Is it correct?). I am not using using softmax as I don't want to classify them, I just want features and I rechecking them using euclidean distance.
These are the steps I followed:
## load the model
net = caffe.Net('deploy.prototxt',
caffe.TEST,
weights ='googlenet_finetune_web_car_iter_10000.caffemodel')
# resize the input size as I have only one image in my batch.
net.blobs["data"].reshape(1, 3, 224, 224)
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
bbox = frame[int(x1):int(x2), int(y1):int(y2)] # getting the car, # I have stored x1,x2,x3,x4 seperatly.
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(bbox, (224, 224))
# to align my input to the input of the model
bbox_input = bbox.swapaxes(1,2).reshape(3,224,224)
# fed input image to the model.
net.blobs['data'].data[0] = bbox_input
net.forward()
# features from pool5 layer or the last layer.
temp = net.blobs["pool5"].data[0]
Now, I want to confirm if these steps are correct or not? I am new to caffe and I am not sure about the steps I wrote above.
Both options are valid. The farther you are from the end of your network, the less specialized the features will be to your problem/training set, while still capturing relevant information that may be applied to similar tasks. As you move to the end of the network, the features will be more tuned to your task.
Note that you are dealing with two similar problems/tasks. The network was fine-tuned for car classification ("which model is this car?") and now you want to verify if two cars belong to the same model.
Considering the network was fine-tuned with a large and representative training set, the features obtained from it are powerful and with a lot of representation capability (i.e., they capture a lot of complex underlying patterns of the task they were trained to) that are useful to your verification task.
With this in mind, you could try many ways of comparing two feature vectors:
Euclidean Distance is too simple. I would try it only because it is easy/fast to implement;
Cosine Similarity [1] might also be a simple, but good starting point;
Classifier. Another possibility that I've have done in a similar problem was to train a classifier (SVM, Logistic Regression) on top of a combination of the two features. The input of your classifier could be the concatenation of them side by side.
Incorporate the verification task to your network. You could alter the GoogleNet architecture to receive two photos of cars and output if they belong or not to the same model. You would be transforming/fine-tuning your network from the classification problem to a verification task. Check for siamese networks [2].
Edit: there is a mistake when you resize your frame that may be the cause of your problems!
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(frame, (224, 224))
You should have passed frame as input in the cv2.resize() method. You are probably feeding a garbage input to the network and that is why the output ends up always looking similar.
The purpose of the task is to classify images by means of SVM. The variable 'images' is supposed to contain the image information and correspondingly labels contains image labels. How can I build (what format and dimensions) should the images and labels have? I tried unsuccesfully images to be a Python array (appending flattened images) and then, in another attempt, Numpy arrays:
images=np.zeros((number_of_images, image_size))
labels=np.zeros((number_of_images, 1))
svm=cv2.SVM()
svm.train(images, labels)
Is it a right approach to the problem and if so, what is the correct way for training the classifier?
I don't think that you can use raw image data to train SVM model. Ok-ok, you can, but it won't be very fruitful.
The basic approach is to extract some features from each image and to use these features for training your model. A set of features forms a dictionary of words, each of which describes your image. Due to the fact that you are using the same set of words to describe each image, you can compare features corresponding to different images. This link introduces more details, check it.
Whats next?
Choose a feature extractor for your algo - HOG, SURF, SIFT (link)
Extract features from each image. You'll get an array of the same length as images array.
Initialize bag-of-words (BoG) model
Train SVM with BoG
Useful links:
C++ vey detailed example
Documentation for existing BOG classifier