I want to find the probability distribution of two images so I can calculate KL Divergence.
I'm trying to figure out what probability distribution means in this sense. I've converted my images to grayscale, flattened them to a 1d array and plotted them as a histogram with bins = 256
imageone = imgGray.flatten() # array([0.64991451, 0.65775765, 0.66560078, ...,
imagetwo = imgGray2.flatten()
plt.hist(imageone, bins=256, label = 'image one')
plt.hist(imagetwo, bins=256, alpha = 0.5, label = 'image two')
plt.legend(loc='upper left')
My next step is to call the ks_2samp function from scikit to calculate the divergence, but I'm unclear what arguments to use.
A previous answer explained that we should take the "take the histogram of the image(in gray scale) and than divide the histogram values by the total number of pixels in the image. This will result in the probability to find a gray value in the image."
Ref: Can Kullback-Leibler be applied to compare two images?
But what do we mean by take the histogram values? How do I 'take' these values?
Might be overcomplicating things, but confused by this.
The hist function will return 3 values, the first of which is the values (i.e., number counts) in each histogram bin. If you pass the density=True argument to hist, these values will be the probability density in each bin. I.e.,:
prob1, _, _ = plt.hist(imageone, bins=256, density=True, label = 'image one')
prob2, _, _ = plt.hist(imagetwo, bins=256, density=True, alpha = 0.5, label = 'image two')
You can then calculate the KL divergence using the scipy entropy function:
from scipy.stats import entropy
entropy(prob1, prob2)
Related
I am struggling to understand why np.mean fails when trying to estimate the mean value of a noisy signal. I thought about floating point errors but don't think that's the case.
Some background: I'm trying to estimate the noise standard deviation and the mean of the noise in a signal (image of a sharp edge).
First, let's generate a noisy image of an edge:
import numpy as np
from scipy import special
Let's define a function that describes the edge
esf_p = lambda x, sigma, mu, I1, I0: I0 + (0.5)*(I1 - I0)*(1 + special.erf((x-mu)/(np.sqrt(2)*sigma)))
And now let's turn it into an image
x = np.mgrid[0:100,0:100][1]
im = esf_p(x, 0.62, 50 ,100, 0)
The output image looks like this
Now let's add some noise to it
noisefunc = np.random.normal(loc = 0, scale = esf_p(x, 0.62, 50, 10, 10))
im_noise = im + noisefunc
The problem I'm having is computing the mean of the noise. First we need to estimate the noise:
noise_est = im_noise - np.mean(im_noise, axis=0)
noise_mean_estimate = np.mean(noise_est, axis=0)
We can see that the the estimated noise is really close to the the noise function we added to the image if we pot a random row of pixels from the image (let's do the last row of pixels).
plt.plot(noise_est[-1], label = 'estimated')
plt.plot(noisefunc[-1], label= 'original noise')
plt.legend()
Comparing the estimated noise with the ground truth noise function:
We can see that the estimated noise is pretty close to the ground truth noise function.
Now, here is where it gets weird. I want to compute the mean of the noise estimate. However, the values I get are absurd.
noise_mean = np.mean(noise_est, axis=0)
plt.plot(np.sqrt(100)*noise_mean, label = 'mean of noise from noise estimate')
plt.legend()
Mean of the estimated noise function:
Obviously the mean should be a lot higher (we used a standard deviation of 10).
The factor of square root of 100 just reflects the fact that we have averaged over a 100 rows in the image and we want the mean of the noise, not the noise of the mean.
What's even more bizarre is that if we get rid of the last row of pixels in our estimate of the noise mean the results change dramatically (by 13 orders of magnitude).
noise_mean = np.mean(noise_est[:-1, :], axis=0)
plt.plot(np.sqrt(99)*noise_mean, label = 'mean of noise from noise estimate')
plt.legend()
Mean of the estimated noise function without the last row:
Now, the estimated mean looks a lot better, however it is still 1 order of magnitude lower than the ground truth.
Oddly enough, this issue does not pop up when estimating the mean of the ground truth noise function:
plt.plot(np.sqrt(100)*np.mean(noisefunc, axis=0), label = 'mean of ground truth noise')
plt.plot(noisefunc[-1], label = 'noise in last row of pixels')
plt.legend()
Comparing the mean of the ground truth noise with the last row of pixels in ground truth noise function:
So we can see that the square root of n term just scales up the estimated noise value.
Let's try to compute the standard deviation as a function of increasing rows:
test_std = np.zeros(100)
for i in range(100):
test_std[i] = np.sqrt(i)*np.std(np.mean(noise_est[:i+1, :], axis=0), axis=0)
plt.plot(test_std, label = 'std of estimated noise mean')
plt.legend()
Standard deviation of the estimated noise mean as a function of increasing rows:
We can see that it tries to go up to 10 for a low number of rows and then drops massively (with the full 100 rows going down to something like 1e-13). So it get's less and less accurate as more rows are averaged over (one would expect the opposite)
If we investigate the behavior of the mean of the ground truth noise function as a function of increasing rows we see that mean remains constant:
test_std = np.zeros(100)
for i in range(100):
test_std[i] = np.sqrt(i)*np.std(np.mean(noisefunc[:i+1, :], axis=0), axis=0)
plt.plot(test_std, label = 'std of ground truth noise function')
plt.legend()
Standard deviation of the mean of the ground truth noise function with respect to increasing rows:
So what could be happening with np.mean()? Why is it outputting such a drastic change in the mean of the noise estimate just by taking away the last row? And why would it behave well for the ground truth but completely mess up the estimated noise? They seem incredibly close as we saw.
I am not sure what I am doing wrong so any comments/ suggestions are greatly appreciated!
I'm trying to match the generalized extreme value (GEV) distribution's probability density function (pdf) to the data' pdf. This histogram is function of bin. As adjust this bin, the result of the function fitting also changes. And curve_fit(func, x, y) is playing this role properly. but this function uses a "least squares estimation". What I want is to use maximum likelihood estimation (MLE). And it has good results with the stats.genextreme.fit(data)function. However, this function does not represent histogram shape changes according to bin. Just use the original data.
I'm trying to use MLE. And I succeeded in estimating the parameters of the standard normal distribution using MLE. However, it is based on the original data and does not change according to the bin. Even the parameters of the GEV could not be estimated with the original data.
I checked the source code of genextreme_gen, rv_continuous, etc. But, this code is too complicated. I couldn't accept the source code with my Python skills.
I would like to estimate the parameters of the GEV distribution through MLE. And I want to get the result that the estimate changes according to bin.
What should I do?
I am sorry for my poor English, and thank you for your help.
+)
h = 0.5 # bin width
dat = h105[1] # data
b = np.arange(min(dat)-h/2, max(dat), h) # bin range
n, bins = np.histogram(dat, bins=b, density=True) # histogram
x = 0.5*(bins[1:]+bins[:-1]) # x-value of histogram
popt,_ = curve_fit(fg, x, n) # curve_fit(GEV's pdf, x-value of histogram, pdf value)
popt = -popt[0], popt[1], popt[2] # estimated paramter (Least squares estimation, LSE)
x1 = np.linspace((popt[1]-popt[2])/popt[0], dat.max(), 1000)
a1 = stats.genextreme.pdf(x1, *popt) # pdf
popt = stats.genextreme.fit(dat) # estimated parameter (Maximum likelihood estimation, MLE)
x2 = np.linspace((popt[1]-popt[2])/popt[0], dat.max(), 1000)
a2 = stats.genextreme.pdf(x2, *popt)
bin width = 2
bin width = 0.5
One way to do this is to convert bins to data. You can do so by counting number of data points in each bin and then repeating center of the bin this number of times.
I have also tried to sample uniform values from each bin, but using center of the bin and then repeating it seems to provide parameters with higher likelihood.
import scipy.stats as stats
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
ground_truth_params = (0.001, 0.5, 0.999)
count = 50
h = 0.2 # bin width
dat = stats.genextreme.rvs(*ground_truth_params, count) # data
b = np.arange(np.min(dat)-h/2, np.max(dat), h) # bin range
n, bins = np.histogram(dat, bins=b, density=True) # histogram
bin_counts, _ = np.histogram(dat, bins=b, density=False) # histogram
x = 0.5*(bins[1:]+bins[:-1]) # x-value of histogram
def flatten(l):
return [item for sublist in l for item in sublist]
popt,_ = curve_fit(stats.genextreme.pdf, x, n, p0=[0,1,1]) # curve_fit(GEV's pdf, x-value of histogram, pdf value)
popt_lse = -popt[0], popt[1], popt[2] # estimated paramter (Least squares estimation, LSE)
popt_mle = stats.genextreme.fit(dat) # estimated parameter (Maximum likelihood estimation, MLE)
uniform_dat_from_bins = flatten((np.linspace(x - h/2, x + h/2, n) for n, x in zip(bin_counts, x)))
popt_uniform_mle = stats.genextreme.fit(uniform_dat_from_bins) # estimated parameter (Maximum likelihood estimation, MLE)
centered_dat_from_bins = flatten(([x] * n for n, x in zip(bin_counts, x)))
popt_centered_mle = stats.genextreme.fit(centered_dat_from_bins) # estimated parameter (Maximum likelihood estimation, MLE)
plot_params = {
ground_truth_params: 'tab:green',
popt_lse: 'tab:red',
popt_mle: 'tab:orange',
popt_centered_mle: 'tab:blue',
popt_uniform_mle: 'tab:purple'
}
param_names = ['GT', 'LSE', 'MLE', 'bin centered MLE', 'bin uniform MLE']
plt.figure(figsize=(10,5))
plt.bar(x, n, width=h, color='lightgrey')
plt.ylim(0, 0.5)
plt.xlim(-2,10)
for params, color in plot_params.items():
x_pdf = np.linspace(-2, 10, 1000)
y_pdf = stats.genextreme.pdf(x_pdf, *params) # the normal pdf
plt.plot(x_pdf, y_pdf, label='pdf', color=color)
plt.legend(param_names)
plt.figure(figsize=(10,5))
for params, color in plot_params.items():
plt.plot(np.sum(stats.genextreme.logpdf(dat, *params)), 'o', color=color)
This plot shows PDFs that are estimated using different methods along with ground truth PDF
And the next plot shows of likelihoods of estimated parameters given original data.
PDF that is estimated by MLE on original data has the maximum value as expected. Then follow PDFs that are estimated using histogram bin (centered and uniform). After them there is ground truth PDF. And finally comes PDF with the lowest likelihood, which is estimated using least squares.
Is it possible to make a histogram equalization without the extreme values 0 and 255?
Specifically I have an image, in which many pixels are zero. More than half of all pixels are zero. So if I do a histogram equalization there I shift basically the value 1 up to value 240 which is exactly the opposite what I want to do with a histogram equalization.
So is there a method to only calculate the histogram equalization between values 1 and 254?
At the moment my code looks the following:
flat = image.flatten()
# get image histogram
image_histogram, bins = np.histogram(flat, bins=range(0, number_bins), density=True)
cdf = image_histogram.cumsum() # cumulative distribution function
cdf = 255 * cdf /cdf.max() # normalize
cdf = cdf.astype('uint8')
# use linear interpolation of cdf to find new pixel values
image_equalized = np.interp(flat, bins[:-1], cdf)
image_equalized = image_equalized.reshape(image.shape), cdf
Thanks
One way to solve this would be to filter out the unwanted values before we make the histogram, and then make a "conversion table" from a non-normalized pixel to a normalized pixel.
import numpy as np
# generate random image
image = np.random.randint(0, 256, (32, 32))
# flatten image
flat = image.flatten()
# get image histogram
image_histogram, bins = np.histogram(flat[np.where((flat != 0) & (flat != 255))[0]],
bins=range(0, 10),
density=True)
cdf = image_histogram.cumsum() # cumulative distribution function
cdf = 255 * cdf /cdf.max() # normalize
cdf = cdf.astype('uint8')
# use linear interpolation of cdf to find new pixel values
# we make a list conversion_table, where the index is the original pixel value,
# and the value is the histogram normalized pixel value
conversion_table = np.interp([i for i in range(0, 256)], bins[:-1], cdf)
# replace unwanted values by original
conversion_table[0] = 0
conversion_table[-1] = 255
image_equalized = np.array([conversion_table[pixel] for pixel in flat])
image_equalized = image_equalized.reshape(image.shape), cdf
disclaimer: I have absolutely no experience whatsoever with image processing, so I have no idea about the validity :)
Taking a tip from another thread (#EnricoGiampieri's answer to cumulative distribution plots python), I wrote:
# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')
I put in the density=1 from the documentation on np.histogram, which says:
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
Well, indeed, when plotted, they don't sum to 1. But, I do not understand the "bins of unity width." When I set the bins to 1, of course, I get an empty chart; when I set them to the population size, I don't get a sum to 1 (more like 0.2). When I use the 40 bins suggested, they sum to about .006.
Can anybody give me some guidance? Thanks!
You can simply normalize your values variable yourself like so:
unity_values = values / values.sum()
A full example would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)
ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)
ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()
You need to make sure your bins are all width 1. That is:
np.all(np.diff(base)==1)
To achieve this, you have to manually specify your bins:
bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)
And you get:
In [18]: np.all(np.diff(base)==1)
Out[18]: True
In [19]: np.sum(values)
Out[19]: 0.99999999999999989
Actually the statement
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
means that the output that we are getting is the probability density function for the respective bins,
now since in pdf, the probability between two value say 'a' and 'b' is represented by the area under the pdf curve between the range 'a' and 'b'.
therefore to get the probability value for a respective bin, we have to multiply the pdf value of that bin by its bin width, and then the sequence of probabilities obtained can be directly used for calculating the cumulative probabilities(as they are now normalized).
note that the sum of the new calculated probabilities will give 1, which satisfies the fact that the sum of total probability is 1, or in other words, we can say that our probabilities are normalized.
see code below,
here i have use bins of different widths, some are of width 1 and some are of width 2,
import numpy as np
import math
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=1000),
rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar) # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar,
density = True)
print(counts) # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges))) #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.
Now the reason I think they try to mention by saying that the width of bins should be 1, is might be because of the fact that if the width of bins is equal to 1, then the value of pdf and probabilities for any bin are equal, because if we calculate the area under the bin, then we are basically multiplying the 1 with the corresponding pdf of that bin, which is again equal to that pdf value.
so in this case, the value of pdf is equal to the value of the respective bins probabilities and hence already normalized.
Taking a tip from another thread (#EnricoGiampieri's answer to cumulative distribution plots python), I wrote:
# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')
I put in the density=1 from the documentation on np.histogram, which says:
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
Well, indeed, when plotted, they don't sum to 1. But, I do not understand the "bins of unity width." When I set the bins to 1, of course, I get an empty chart; when I set them to the population size, I don't get a sum to 1 (more like 0.2). When I use the 40 bins suggested, they sum to about .006.
Can anybody give me some guidance? Thanks!
You can simply normalize your values variable yourself like so:
unity_values = values / values.sum()
A full example would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)
ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)
ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()
You need to make sure your bins are all width 1. That is:
np.all(np.diff(base)==1)
To achieve this, you have to manually specify your bins:
bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)
And you get:
In [18]: np.all(np.diff(base)==1)
Out[18]: True
In [19]: np.sum(values)
Out[19]: 0.99999999999999989
Actually the statement
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
means that the output that we are getting is the probability density function for the respective bins,
now since in pdf, the probability between two value say 'a' and 'b' is represented by the area under the pdf curve between the range 'a' and 'b'.
therefore to get the probability value for a respective bin, we have to multiply the pdf value of that bin by its bin width, and then the sequence of probabilities obtained can be directly used for calculating the cumulative probabilities(as they are now normalized).
note that the sum of the new calculated probabilities will give 1, which satisfies the fact that the sum of total probability is 1, or in other words, we can say that our probabilities are normalized.
see code below,
here i have use bins of different widths, some are of width 1 and some are of width 2,
import numpy as np
import math
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=1000),
rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar) # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar,
density = True)
print(counts) # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges))) #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.
Now the reason I think they try to mention by saying that the width of bins should be 1, is might be because of the fact that if the width of bins is equal to 1, then the value of pdf and probabilities for any bin are equal, because if we calculate the area under the bin, then we are basically multiplying the 1 with the corresponding pdf of that bin, which is again equal to that pdf value.
so in this case, the value of pdf is equal to the value of the respective bins probabilities and hence already normalized.