I'm using this code to load my net:
net = caffe.Classifier(MODEL_FILE, PRETRAINED,
mean=np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1),
channel_swap=(2,1,0),
raw_scale=255,
image_dims=(256, 256))
I have doubts on three lines.
1- mean=np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1)
What is mean? Should I use this mean value or another? And if yes, where can I get custom mean value? I'm using a custom dataset.
2- channel_swap=(2,1,0)
What channel_swap means? And again, should I use this value or an custom value?
And the last
3- raw_scale=255
What is raw_scale? And what value should I use?
I'm using Cohn Kanade dataset. All images are 64x64 and in grayscale.
The channel_swap is to reverse RGB into BGR, which is apparently necessary if you use a reference image net model, based on a comment in [1]. In your case the images are greyscale, so you probably do not have three channels. You might need to set it to (0, 0, 0), but even that might not help (I am unsure on the exact implementation of channel_swap). If that does not help, the simplest solution might be to preprocess you data by splitting every pixel into three values (RGB) with equal values. After that you might drop channel_swap altogether, because your channels have the same value, and swapping them is a no-op.
Mean is what will be subtracted from your input data to center it. (Remember that neural networks need the data to have zero mean, while the input images usually have positive mean, hence the need of the subtraction). The mean you subtract should be the same that was used for training, so using mean from the file associated with the model is correct. I am not sure, however, on whether you should call .mean(1) on it -- did you get that line from some example? If yes, then it is most likely the correct thing to do.
raw_scale is a scale of your input data. The model expects pixels to be normalized, so if your input data has values between 0 and 255, then raw_scale set to 255 is correct. If your data has values between 0 and 1, then raw_scale should be set to 1.
Finally, based on my understanding of the comment in [2] you do not need to provide image_dims
[1] https://github.com/BVLC/caffe/blob/master/python/caffe/io.py#L204
[2] https://github.com/BVLC/caffe/blob/master/python/caffe/classifier.py#L18
I agree on comments of #Ishamael on channel_swap and mean. I just wanted to add further clarification on raw_scale. Assuming that images are loaded with caffe.io.load_image, values are always in the range of 0 to 1 [1]. Just to note that:
While Python represents images in [0, 1], certain Caffe models
CaffeNet and AlexNet represent images in [0, 255] so the raw_scale
these models must be 255.
And I think it's wise to check the input image values prior to feeding to the data layer of network in order to choose appropriate raw_scale.
Thank you.
[1] https://github.com/BVLC/caffe/blob/master/python/caffe/io.py#L224
Related
I have a jpg photo of the brain with a shape of (430,355) and a heatmap image (5x5 shape) relating to the different brain parts.
I want to combine these two in a way that shows which part of the brain is more active.
and
what I want is:
The easiest solution can be to add the second images with a weight, with OpenCv, after resizing the second one so that it matches the size of the original one:
heatmap = cv2.resize(heatmap, (brain.shape[1], brain.shape[0]))
combined = cv2.addWeighted(brain, 1, heatmap, 0.7, 1)
This is the output:
Modifying the parameters you can have the result that best fits your use case.
addWeighted is a function that calculates the weighted sum of two arrays.
Here you can find the documentation: https://docs.opencv.org/3.4/d2/de8/group__core__array.html#gafafb2513349db3bcff51f54ee5592a19
Is not exactly like the one in your question, but it is a fast and effective approach.
I am trying to understand the reason behind the answer to this question. I was expecting the number of parameters to be:
total_params = (filter_height * filter_width + 1) * number_of_filters
BUT you have to multiply the height and width by the number of input channels. Why is this? Isn't there parameter sharing for this dimension? If this is the case, how does this help with feature recognition?
I would expect a CNN to be able to infer relationships between channels, but I haven't seen how this is explicitly done.
Imagine you have an RGB image and want to pass a single filter: number_of_filters = 1.
How would this filter treat each of the input channels: R, G and B?
Should the filter treat all input channels equally? Does the green channel bring the same information as the red?
Well, no, each channel has its own information and the filter must consider all input channels, otherwise it will not be looking at the whole image.
This is exactly the same as with dense/fully connected networks, where you have:
total_params =( input_dim + 1 ) * units
The only difference is that a convolutional filter has height and width.
Currently my code is succeeding in visualizing the deeper layers of a network via activation maximization. However, in order to get a more interpretable image, I'm experimenting with different regularization methods. currently I'm regularizing via a Gaussian convolution. See Understanding Neural Networks Through Deep Visualization by Yosinski .
To do this I've added a Gaussian loss to my loss function. I'm using Python & Tensorflow. The Gaussian loss is calculated by (each iteration) subtracting a blurred image from the current image, and thereby steering the network towards producing a more blurry final image.
First a Gaussian kernel is made of size 4x4.
Then, I perform a convolution with this kernel for each color channel through tf.conv2d with the code: (gauss_var is the gaussian kernel with dimension [4, 4, 1, 1])
# unstack 3 channel image
[tR, tG, tB] = tf.unstack(input_image, num=3, axis=3)
# give each one a fourth dimension in order to use it in conv2d
tR = tf.expand_dims(tR, 3)
tG = tf.expand_dims(tG, 3)
tB = tf.expand_dims(tB, 3)
#convolve each input image with the gaussian filter
tR_gauss = tf.nn.conv2d(tR, gauss_var, strides=[1, 1, 1, 1], padding='SAME')
tG_gauss = tf.nn.conv2d(tG, gauss_var, strides=[1, 1, 1, 1], padding='SAME')
tB_gauss = tf.nn.conv2d(tB, gauss_var, strides=[1, 1, 1, 1], padding='SAME')
I calculate the difference by doing:
# calculate difference
R_diff = tf.subtract(tR, tR_gauss)
G_diff = tf.subtract(tR, tG_gauss)
B_diff = tf.subtract(tR, tB_gauss)
And make it into one number:
total_diff = tf.add_n([R_diff, G_diff, B_diff])
gaussian_loss = tf.reduce_sum(total_diff)
The problem is that the resulting image always shows bars at the borders, and is colored blueish. This is an over-exaggerated example of a final image.
I'm pretty sure this bordering effect has something to do with conv2d, but I don't know how to change it. So far I've tried using different kernel sizes, and although the borders change, they still remain. Changing padding from 'SAME' to 'VALID' results in different output dimensions which is also problematic. Any ideas on how to solve this?
Thanks in advance!
Cheers,
I had a similar problem with ugly borders around my output image.
I found out that the padding='SAME' option of conv2d adds zeros to the outside of the image.
In my case the problem was, that my images had a white background, which is color value 255 and so there was added a black border which results in a large color gradient.
Maybe this thought can help others even if its almost a year later...
Wow, I'm sorry no one answered your question. I'm no expert, but I'll give it a try. First, there is no perfect solution.
The easiest way is to look at tf.pad. That has a couple of modes that allows you to grow your images by copying the edges. Or, you may need to resort to tf.tile. The padding amount should match your kernel 1/2width. Then use 'VALID' for padding and you will get an output smaller but the same size as your input.
For a better, yet more difficult solution, you can work in the frequency domain. Convert the images to frequency domain. Then zero out some of the higher frequencies and then return to the space domain.
Hopefully you have already solved your problem.
For future readers, this link points to a small project that uses tf.nn.depthwise_conv2d to avoid the problem with the edges.
I'm implementing a CNN with Theano. In the paper, I have to do this image preprocess before train the CNN
We extracted RGB patches of 61x61 dimensions associated with each poselet activation, subtracted the mean and used this data to train the convnet model shown in Table 1
Can you tell me what does it mean with "subtracted the mean"? Tell me if these steps are correct (it is what I understood)
1) Compute the mean for Red Channel, Green Channel and Blue Channel for the whole image
2) For each pixel, subtract from red value the mean of red channel, from green value the mean of green channel and the same for the blue channel
3) Is it correct to have negative value or do I have use the abs?
Thanks all!!
You should read paper carefully, but what is the most probable is that they mean mean of the patches, so you have N matrices 61x61 pixels, which is equivalent of a vector of length 61^2 (if there are three channels then 3*61^2). What they do - they simple compute mean of each dimension, so they calculate mean over these N vectors in respect to each of the 3*61^2 dimensions. As the result they obtain a mean vector of length 3*61^2 (or mean matrix/mean patch if you prefer) and they substract it from all of these N patches. Resulting patches will have negatives values, it is perfectly fine, you should not take abs value, neural networks prefer this kind of data.
I would assume the mean mentioned in the paper is the mean over all images used in the training set (computed separately for each channel).
Several indications:
Caffe is a lib for ConvNets. In their tutorial they mention the compute image mean part: http://caffe.berkeleyvision.org/gathered/examples/imagenet.html
For this they use the following script: https://github.com/BVLC/caffe/blob/master/examples/imagenet/make_imagenet_mean.sh
which does what I indicated.
Google played around with ConvNets and published their code here: https://github.com/google/deepdream/blob/master/dream.ipynb and they do also use the mean of the training set.
This is of course only indirect evidence since I can not explain you why this happens. In fact I stumbled over this question while trying to figure out precisely that.
//EDIT:
In the mean time I found a source confirming my claim (Highlighting added by me):
There are three common forms of data preprocessing a data matrix X [...]
Mean subtraction is the most common form of preprocessing. It
involves subtracting the mean across every individual feature in the
data, and has the geometric interpretation of centering the cloud of
data around the origin along every dimension. In numpy, this operation
would be implemented as: X -= np.mean(X, axis = 0). With images
specifically, for convenience it can be common to subtract a single
value from all pixels (e.g. X -= np.mean(X)), or to do so separately
across the three color channels.
As we can see, the whole data is used to compute the mean.
I want to train a SVM for object detection. At this point I have a python script which detects FAST keypoints and extracts BRIEF features at that location.
Now I don't know how to use these descriptors to train a SVM.
Would you tell me please:
How to use the descriptors to train the SVM (As far as I know these descriptors should be my train data)?
What are labels used for and how I can get them?
To train a SVM you would need a matrix X with your features and a vector y with your labels. It should look like this for 3 images and two features:
>>> from sklearn import svm
>>> X = [[0, 0], <- negative 0
[1, 3], <- positive 1
2, 5]] <- negative 0
>>> y = [0,
1,
0]
>>> model = svm.SVC()
>>> model.fit(X, y)
The training set would consist of several images, each image would be a row of X and y.
Labels:
For the labels y you need positive and negative examples (0 or 1):
Positive Samples
You can specify positive samples in two ways. One way is to specify
rectangular regions in a larger image. The regions contain the objects
of interest. The other approach is to crop out the object of interest
from the image and save it as a separate image. Then, you can specify
the region to be the entire image. You can also generate more positive
samples from existing ones by adding rotation or noise, or by varying
brightness or contrast.
Negative Samples
Images that do not contain objects of interest.
[slightly edited from here]
Feature matrix X:
Here you can get creative but I will mention a simple idea. Make height * width features, one for each pixel of each image, but make them all 0 except in a small region around the FAST keypoints. In the end your X matrix will have dimension (n_images, height*width).
Another commonly used idea is Bag of Words. The X matrix must have a fixed number of features/columns and the number of keypoints is variable. This is a representation problem but it can be solved binning them in a histogram with a fixed number of bins. For details see for example this paper.
You will have to consult the specialized literature to come up with more ways to incorporate the BRIEF features but I hope this will give you an idea on how to get started.