When I implement batch normalization in python from scrach, I am confused. Please see
A paper demonstrates some figures about normalization methods, I think it may be not correct. The description and figure are both not correct.
Description from the paper:
Figure from the paper:
As far as I am concerned, the representation of batch normalization is not correct in the original paper. I post the issue here for discussion.
I think the batch normalization should be like the following figure.
The key point is how to calculate mean and std.
With feature maps' shape as (batch_size, channel_number, width, height),
mean = X.mean(axis=(0, 2, 3), keepdims=True)
or
mean = X.mean(axis=(0, 1), keepdims=True)
Which one is correct?
You should calculate mean and std across all pixels in the images of the batch. So use axis=(0, 2, 3) parameters.
If the channels have roughly same distributions - you may calculate mean and std across channels as well. so just use mean() and std() without axes parameter.
The figure in the article is correct - it takes mean and std across H and W (image dimensions) for each batch. Obviously, channel is not shown in the 3d cube.
Related
In this page (https://pytorch.org/vision/stable/models.html), it says that "All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]".
Shouldn't the usual mean and std of normalization be [0.5, 0.5, 0.5] and [0.5, 0.5, 0.5]? Why is it setting such strange values?
Using the mean and std of Imagenet is a common practice. They are calculated based on millions of images. If you want to train from scratch on your own dataset, you can calculate the new mean and std. Otherwise, using the Imagenet pretrianed model with its own mean and std is recommended.
In that example, they are using the mean and stddev of ImageNet, but if you look at their MNIST examples, the mean and stddev are 1-dimensional (since the inputs are greyscale-- no RGB channels).
Whether or not to use ImageNet's mean and stddev depends on your data. Assuming your data are ordinary photos of "natural scenes"† (people, buildings, animals, varied lighting/angles/backgrounds, etc.), and assuming your dataset is biased in the same way ImageNet is (in terms of class balance), then it's ok to normalize with ImageNet's scene statistics. If the photos are "special" somehow (color filtered, contrast adjusted, uncommon lighting, etc.) or an "un-natural subject" (medical images, satellite imagery, hand drawings, etc.) then I would recommend correctly normalizing your dataset before model training!*
Here's some sample code to get you started:
import os
import torch
from torchvision import datasets, transforms
from torch.utils.data.dataset import Dataset
from tqdm.notebook import tqdm
from time import time
N_CHANNELS = 1
dataset = datasets.MNIST("data", download=True,
train=True, transform=transforms.ToTensor())
full_loader = torch.utils.data.DataLoader(dataset, shuffle=False, num_workers=os.cpu_count())
before = time()
mean = torch.zeros(1)
std = torch.zeros(1)
print('==> Computing mean and std..')
for inputs, _labels in tqdm(full_loader):
for i in range(N_CHANNELS):
mean[i] += inputs[:,i,:,:].mean()
std[i] += inputs[:,i,:,:].std()
mean.div_(len(dataset))
std.div_(len(dataset))
print(mean, std)
print("time elapsed: ", time()-before)
† In computer vision, "Natural scene" has a specific meaning which isn't related to nature vs man-made, see https://en.wikipedia.org/wiki/Natural_scene_perception
* Otherwise you run into optimization problems due to elongations in the loss function-- see my answer here.
I wasn't able to calculate the standard deviation as planned, but did it using the code below. The grayscale imagenet's train dataset mean and standard deviation are (round it as much as you like):
Mean: 0.44531356896770125
Standard Deviation: 0.2692461874154524
def calcSTD(d):
meanValue = 0.44531356896770125
squaredError = 0
numberOfPixels = 0
for f in os.listdir("/home/imagenet/ILSVRC/Data/CLS-LOC/train/"+str(d)+"/"):
if f.endswith(".JPEG"):
image = imread("/home/imagenet/ILSVRC/Data/CLS-LOC/train/"+str(d)+"/"+str(f))
###Transform to gray if not already gray anyways
if np.array(image).ndim == 3:
matrix = np.array(image)
blue = matrix[:,:,0]/255
green = matrix[:,:,1]/255
red = matrix[:,:,2]/255
gray = (0.2989 * red + 0.587 * green + 0.114 * blue)
else:
gray = np.array(image)/255
###----------------------------------------------------
for line in gray:
for pixel in line:
squaredError += (pixel-meanValue)**2
numberOfPixels += 1
return (squaredError, numberOfPixels)
a_pool = multiprocessing.Pool()
folders = []
[folders.append(f.name) for f in os.scandir("/home/imagenet/ILSVRC/Data/CLS-LOC/train") if f.is_dir()]
resultStD = a_pool.map(calcSTD, folders)
StD = (sum([intensity[0] for intensity in resultStD])/sum([pixels[1] for pixels in resultStD]))**0.5
print(StD)
Source: https://stackoverflow.com/a/65717887/7156266
TL;DR
I believe the reason is, like many things in (deep) machine learning,
it just happens to work well.
Details
The word 'normalization' in statistic can apply to different transformation.
For example:
for all x in X: x->(x - min(x))/(max(x)-min(x)
will normalize and stretch the values of X to [0..1] range.
Another example:
for all x in X: x->(x - mean(X))/stdv(x)
will transform the image to have mean=0, and standard deviation = 1. This transformation is called standard score, or sometimes standardization. If we multiply the result by sigma, and add mu we will set the result to have mean=mu and stdv=sigma
PyTorch doesn't do any of these - instead it applies the standard score, but not with the mean and stdv values of X (the image to be normalized) but with values that are the average mean and average stdv over a large set of Imagenet images. But does not set the mean and stdv to these value.
If the image happens to have the same mean and standard deviation as the average of Imagenet set - it will be transformed to have mean 0 and stdv 1.
Otherwise, it will transform to something which is a function of its mean and stdv and said averages.
To me it is not clear what this rigorously means (why average of stdv?, and why applying standard score to the averages?).
Perhaps someone can clarify this?
However, like many things in deep machine learning, the theory is not fully established yet. My guess is that people have tried different normalization and this one just happens to perform well.
It does not means that this is the best possible normalization, only that it is a decent one. And of course, if you are using pre-trained values that were learned using this specific normalization, you are probably better of using the same normalization for inference or derived model as was used in the training.
I am trying to port some lua/torch code in Python, there's a sequence that runs a Gaussian blur over an image as follows:
local gauK = image.gaussian(math.ceil(3*sigma)*2+1, sigma)
inp = image.convolve(inp, gauK, 'same')
To replicate this in my approach, I have been looking at cv2.GaussianBlur() and cv2.filter2D with passing in a gaussian kernel.
Method 1 (cv2.GaussianBlur):
kernel_size = int(math.ceil(3 * sigma) * 2 + 1) # same size as the lua code
blurred_image = cv2.GaussianBlur(img, ksize=(kernel_size, kernel_size), sigma)
Method 2 (cv2.filter2D)
kernel_size = int(math.ceil(3 * sigma) * 2 + 1) # same size as the lua code
gaussian_kernel = cv2.getGaussianKernel(kernel_size, sigma)
blurred_image_2 = cv2.filter2D(img, -1, gaussian_kernel)
Between Method 1 & Method 2, I get different images. It appears that the Method 1 image is a bit more blurred than in Method 2. Is there any reason why I might be getting different results here? I am trying to figure out which one would match the lua code. Thanks.
This is a weird one; for practicality's sake, I'd suggest you just pick one you're happy with and use it. That aside, I'm guessing that the semantics of how the multiple arguments are handled cause this mismatch. Also, opencv's equation for inferring sigma from kernelsize and vice versa does not appear to match yours.
From GaussianBlur docs:
ksize: Gaussian kernel size. ksize.width and ksize.height can differ but they both must be positive and odd. Or, they can be zero's and
then they are computed from sigma.
sigmaX: Gaussian kernel standard deviation in X direction.
sigmaY: Gaussian kernel standard deviation in Y direction; if sigmaY is zero, it is set to be equal to sigmaX, if both sigmas are zeros,
they are computed from ksize.width and ksize.height, respectively (see
cv::getGaussianKernel for details); to fully control the result
regardless of possible future modifications of all this semantics, it
is recommended to specify all of ksize, sigmaX, and sigmaY.
And the getGaussianKernel docs:
If it is non-positive, it is computed from ksize as sigma = 0.3\*((ksize-1)\*0.5 - 1) + 0.8
All emphasis mine. Is there any chance your sigma is negative? That might cause a mismatch.
EDIT: just noticed you want this to match the lua code. My advice would be to save out the results, then compare them in photoshop or your favorite image editor. If you subtract the test from the reference, you should be able to see the difference, and as your attempts get closer, there should be less difference overall. Barring that, you could try and read the source to figure out the difference in definition, or write your own!
Good luck!
cv2.getGaussianKernel function return 1D vector, to make it 2D gaussian matrix, you can multiply it with its transpose ( # is used for matrix multiplication). Could you try this code:
gaussian_kernel = cv2.getGaussianKernel(kernel_size, sigma)
kernel_2D = gaussian_kernel # gaussian_kernel.transpose()
blurred_image_2 = cv2.filter2D(img, -1, kernel_2D)
I have two arrays of size (n, m, m) (n number of images of size (m,m)). I want to perform a cross correlation between each corresponding n of the two arrays.
Example: n=1 -> corr2d([m,m]<sub>1</sub>,[m,m]<sub>2</sub>)
My current way include a bunch of for loops in python:
for i in range(len(X)):
X_co = X[i,0,:,:]/(np.max(X[i,0,:,:]))
X_x = X[i,1,:,:]/(np.max(X[i,1,:,:]))
autocorr[i,0,:,:]=correlate2d(X_co, X_x, mode='same', boundary='fill', fillvalue=0)
Obviously this is very slow when the input contain many images, and becomes a substantial part of the total run time if (m,m) << n.
The obvious optimization is to skip the loop and feed everything directly to the compiled correlation function. Currently I'm using scipy's correlate2d.
I've looked around but haven't found any function that allows correlation along some axis or multiple inputs.
Any tips on how to make scipy's correlate2d work or alternatives?
I decided to implement it via the FFT instead.
def fft_xcorr2D(x):
# Over axes (-2,-1) (default in the fft2 function)
## Pad because of cyclic (circular?) behavior of the FFT
x = np.fft2(np.pad(x,([0,0],[0,0],[0,34],[0,34]),mode='constant'))
# Conjugate for correlation, not convolution (Conv. Theorem)
x[:,1,:,:] = np.conj(x[:,1,:,:])
# Over axes (-2,-1) (default in the ifft2 function)
## Multiply elementwise over 2:nd axis (2 image bands for me)
### fftshift over rows and column over images
corr = np.fft.fftshift(np.ifft2(np.prod(x,axis=1)),axes=(-2,-1))
# Return after removing padding
return np.abs(corr)[:,3:-2,3:-2]
Call via:
ts=fft_xcorr2D(X)
If anybody wants to use it:
My input is a 4D array: (N, 2, #Rows, #Cols)
E.g. (500, 2, 30, 30): 500 images, 2 bands (polarizations, for example), of 30x30 pixels
If your input is different, adjust the padding to your liking
Check so your input order is the same as mine otherwise change the axes arguments in the fft2 and ifft2 functions, the np.prod and fftshift. I use fftshift to get the maximum value in the middle (otherwise in the corners), so be wary of that if that's not what you want.
Why is it the maximum value? Technically, it doesn't have to be, but for my purpose it is. fftshift is used to get a correlation that looks like you're used to. Otherwise, the quadrants are turned "inside out". If you wonder what I mean, remove fftshift (just the fftshift part, not its arguments), call the function as before, and plot it.
Afterwards, it should be ready to use.
Possibly x.prod(axis=1) is faster than np.prod(x,axis=1) but it's an old post. It shows no improvement for me after trying.
Currently my code is succeeding in visualizing the deeper layers of a network via activation maximization. However, in order to get a more interpretable image, I'm experimenting with different regularization methods. currently I'm regularizing via a Gaussian convolution. See Understanding Neural Networks Through Deep Visualization by Yosinski .
To do this I've added a Gaussian loss to my loss function. I'm using Python & Tensorflow. The Gaussian loss is calculated by (each iteration) subtracting a blurred image from the current image, and thereby steering the network towards producing a more blurry final image.
First a Gaussian kernel is made of size 4x4.
Then, I perform a convolution with this kernel for each color channel through tf.conv2d with the code: (gauss_var is the gaussian kernel with dimension [4, 4, 1, 1])
# unstack 3 channel image
[tR, tG, tB] = tf.unstack(input_image, num=3, axis=3)
# give each one a fourth dimension in order to use it in conv2d
tR = tf.expand_dims(tR, 3)
tG = tf.expand_dims(tG, 3)
tB = tf.expand_dims(tB, 3)
#convolve each input image with the gaussian filter
tR_gauss = tf.nn.conv2d(tR, gauss_var, strides=[1, 1, 1, 1], padding='SAME')
tG_gauss = tf.nn.conv2d(tG, gauss_var, strides=[1, 1, 1, 1], padding='SAME')
tB_gauss = tf.nn.conv2d(tB, gauss_var, strides=[1, 1, 1, 1], padding='SAME')
I calculate the difference by doing:
# calculate difference
R_diff = tf.subtract(tR, tR_gauss)
G_diff = tf.subtract(tR, tG_gauss)
B_diff = tf.subtract(tR, tB_gauss)
And make it into one number:
total_diff = tf.add_n([R_diff, G_diff, B_diff])
gaussian_loss = tf.reduce_sum(total_diff)
The problem is that the resulting image always shows bars at the borders, and is colored blueish. This is an over-exaggerated example of a final image.
I'm pretty sure this bordering effect has something to do with conv2d, but I don't know how to change it. So far I've tried using different kernel sizes, and although the borders change, they still remain. Changing padding from 'SAME' to 'VALID' results in different output dimensions which is also problematic. Any ideas on how to solve this?
Thanks in advance!
Cheers,
I had a similar problem with ugly borders around my output image.
I found out that the padding='SAME' option of conv2d adds zeros to the outside of the image.
In my case the problem was, that my images had a white background, which is color value 255 and so there was added a black border which results in a large color gradient.
Maybe this thought can help others even if its almost a year later...
Wow, I'm sorry no one answered your question. I'm no expert, but I'll give it a try. First, there is no perfect solution.
The easiest way is to look at tf.pad. That has a couple of modes that allows you to grow your images by copying the edges. Or, you may need to resort to tf.tile. The padding amount should match your kernel 1/2width. Then use 'VALID' for padding and you will get an output smaller but the same size as your input.
For a better, yet more difficult solution, you can work in the frequency domain. Convert the images to frequency domain. Then zero out some of the higher frequencies and then return to the space domain.
Hopefully you have already solved your problem.
For future readers, this link points to a small project that uses tf.nn.depthwise_conv2d to avoid the problem with the edges.
I'm implementing a CNN with Theano. In the paper, I have to do this image preprocess before train the CNN
We extracted RGB patches of 61x61 dimensions associated with each poselet activation, subtracted the mean and used this data to train the convnet model shown in Table 1
Can you tell me what does it mean with "subtracted the mean"? Tell me if these steps are correct (it is what I understood)
1) Compute the mean for Red Channel, Green Channel and Blue Channel for the whole image
2) For each pixel, subtract from red value the mean of red channel, from green value the mean of green channel and the same for the blue channel
3) Is it correct to have negative value or do I have use the abs?
Thanks all!!
You should read paper carefully, but what is the most probable is that they mean mean of the patches, so you have N matrices 61x61 pixels, which is equivalent of a vector of length 61^2 (if there are three channels then 3*61^2). What they do - they simple compute mean of each dimension, so they calculate mean over these N vectors in respect to each of the 3*61^2 dimensions. As the result they obtain a mean vector of length 3*61^2 (or mean matrix/mean patch if you prefer) and they substract it from all of these N patches. Resulting patches will have negatives values, it is perfectly fine, you should not take abs value, neural networks prefer this kind of data.
I would assume the mean mentioned in the paper is the mean over all images used in the training set (computed separately for each channel).
Several indications:
Caffe is a lib for ConvNets. In their tutorial they mention the compute image mean part: http://caffe.berkeleyvision.org/gathered/examples/imagenet.html
For this they use the following script: https://github.com/BVLC/caffe/blob/master/examples/imagenet/make_imagenet_mean.sh
which does what I indicated.
Google played around with ConvNets and published their code here: https://github.com/google/deepdream/blob/master/dream.ipynb and they do also use the mean of the training set.
This is of course only indirect evidence since I can not explain you why this happens. In fact I stumbled over this question while trying to figure out precisely that.
//EDIT:
In the mean time I found a source confirming my claim (Highlighting added by me):
There are three common forms of data preprocessing a data matrix X [...]
Mean subtraction is the most common form of preprocessing. It
involves subtracting the mean across every individual feature in the
data, and has the geometric interpretation of centering the cloud of
data around the origin along every dimension. In numpy, this operation
would be implemented as: X -= np.mean(X, axis = 0). With images
specifically, for convenience it can be common to subtract a single
value from all pixels (e.g. X -= np.mean(X)), or to do so separately
across the three color channels.
As we can see, the whole data is used to compute the mean.