When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values?
For example, if I bin my data as following:
windspeed = 8 * np.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * np.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=[1,2,3,4,5,6,7])
plt.figure()
plt.plot(windspeed, boatspeed, 'b.', label='raw data')
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
plt.legend()
how do I get the standard deviation on the bin_means?
The way to go about this is to construct a probability density estimate from the histogram (this is just a question of normalizing the histogram appropriately), and then computing the standard deviation or any other statistic for the estimated density.
The appropriate normalization is whatever is needed to get the area under the histogram to be 1. As for computing statistics for the density estimate, work from the definition of the statistic as integral(p(x)*f(x), x, -infinity, +infinity), substituting the density estimate for p(x) and whatever is needed for f(x), e.g. x and x^2 to get the first and second moments, from which you calculate the variance and then the standard deviation.
I'll post some formulas tomorrow, or maybe someone else wants to give it a try in the meantime. You might be able to look up some formulas, but my advice is to always try to work out the answer before resorting to looking it up.
Maybe I'm a bit late to answer, but I was wondering how to do the same thing and came across this question. I think calculating it with stats.binned_statistic_2d should be possible, but I haven't figured it out yet. For now I calculated it manually, like so (note than in my code I use a fixed number of equally spaced bins):
windspeed = 8 * numpy.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * numpy.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=10)
stds = []
# Match each value to the bin number it belongs to
pairs = zip(boatspeed, binnumber)
# Calculate stdev for all elements inside each bin
for n in list(set(binnumber)): # Iterate over each bin
in_bin = [x for x, nbin in pairs if nbin == n] # Get all elements inside bin n
stds.append(numpy.std(in_bin))
# Calculate the locations of the bins' centers, for plotting
bin_centers = []
for i in range(len(bin_edges) - 1):
center = bin_edges[i] + (float(bin_edges[i + 1]) - float(bin_edges[i]))/2.
bin_centers.append(center)
# Plot means
pyplot.figure()
pyplot.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
# Plot stdev as vertical lines, probably can also be done with errorbar
pyplot.vlines(bin_centers, bin_means - stds, bin_means + stds)
pyplot.legend()
pyplot.show()
Resulting plot (minus the data points):
You have to be careful with the bins. In the code I'm working on using this, one of the bins has no points and I have to adjust my calculations of the stdev accordingly.
just change this line
bin_std, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='std', bins=[1,2,3,4,5,6,7])
Related
I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
I'm trying to match the generalized extreme value (GEV) distribution's probability density function (pdf) to the data' pdf. This histogram is function of bin. As adjust this bin, the result of the function fitting also changes. And curve_fit(func, x, y) is playing this role properly. but this function uses a "least squares estimation". What I want is to use maximum likelihood estimation (MLE). And it has good results with the stats.genextreme.fit(data)function. However, this function does not represent histogram shape changes according to bin. Just use the original data.
I'm trying to use MLE. And I succeeded in estimating the parameters of the standard normal distribution using MLE. However, it is based on the original data and does not change according to the bin. Even the parameters of the GEV could not be estimated with the original data.
I checked the source code of genextreme_gen, rv_continuous, etc. But, this code is too complicated. I couldn't accept the source code with my Python skills.
I would like to estimate the parameters of the GEV distribution through MLE. And I want to get the result that the estimate changes according to bin.
What should I do?
I am sorry for my poor English, and thank you for your help.
+)
h = 0.5 # bin width
dat = h105[1] # data
b = np.arange(min(dat)-h/2, max(dat), h) # bin range
n, bins = np.histogram(dat, bins=b, density=True) # histogram
x = 0.5*(bins[1:]+bins[:-1]) # x-value of histogram
popt,_ = curve_fit(fg, x, n) # curve_fit(GEV's pdf, x-value of histogram, pdf value)
popt = -popt[0], popt[1], popt[2] # estimated paramter (Least squares estimation, LSE)
x1 = np.linspace((popt[1]-popt[2])/popt[0], dat.max(), 1000)
a1 = stats.genextreme.pdf(x1, *popt) # pdf
popt = stats.genextreme.fit(dat) # estimated parameter (Maximum likelihood estimation, MLE)
x2 = np.linspace((popt[1]-popt[2])/popt[0], dat.max(), 1000)
a2 = stats.genextreme.pdf(x2, *popt)
bin width = 2
bin width = 0.5
One way to do this is to convert bins to data. You can do so by counting number of data points in each bin and then repeating center of the bin this number of times.
I have also tried to sample uniform values from each bin, but using center of the bin and then repeating it seems to provide parameters with higher likelihood.
import scipy.stats as stats
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt
ground_truth_params = (0.001, 0.5, 0.999)
count = 50
h = 0.2 # bin width
dat = stats.genextreme.rvs(*ground_truth_params, count) # data
b = np.arange(np.min(dat)-h/2, np.max(dat), h) # bin range
n, bins = np.histogram(dat, bins=b, density=True) # histogram
bin_counts, _ = np.histogram(dat, bins=b, density=False) # histogram
x = 0.5*(bins[1:]+bins[:-1]) # x-value of histogram
def flatten(l):
return [item for sublist in l for item in sublist]
popt,_ = curve_fit(stats.genextreme.pdf, x, n, p0=[0,1,1]) # curve_fit(GEV's pdf, x-value of histogram, pdf value)
popt_lse = -popt[0], popt[1], popt[2] # estimated paramter (Least squares estimation, LSE)
popt_mle = stats.genextreme.fit(dat) # estimated parameter (Maximum likelihood estimation, MLE)
uniform_dat_from_bins = flatten((np.linspace(x - h/2, x + h/2, n) for n, x in zip(bin_counts, x)))
popt_uniform_mle = stats.genextreme.fit(uniform_dat_from_bins) # estimated parameter (Maximum likelihood estimation, MLE)
centered_dat_from_bins = flatten(([x] * n for n, x in zip(bin_counts, x)))
popt_centered_mle = stats.genextreme.fit(centered_dat_from_bins) # estimated parameter (Maximum likelihood estimation, MLE)
plot_params = {
ground_truth_params: 'tab:green',
popt_lse: 'tab:red',
popt_mle: 'tab:orange',
popt_centered_mle: 'tab:blue',
popt_uniform_mle: 'tab:purple'
}
param_names = ['GT', 'LSE', 'MLE', 'bin centered MLE', 'bin uniform MLE']
plt.figure(figsize=(10,5))
plt.bar(x, n, width=h, color='lightgrey')
plt.ylim(0, 0.5)
plt.xlim(-2,10)
for params, color in plot_params.items():
x_pdf = np.linspace(-2, 10, 1000)
y_pdf = stats.genextreme.pdf(x_pdf, *params) # the normal pdf
plt.plot(x_pdf, y_pdf, label='pdf', color=color)
plt.legend(param_names)
plt.figure(figsize=(10,5))
for params, color in plot_params.items():
plt.plot(np.sum(stats.genextreme.logpdf(dat, *params)), 'o', color=color)
This plot shows PDFs that are estimated using different methods along with ground truth PDF
And the next plot shows of likelihoods of estimated parameters given original data.
PDF that is estimated by MLE on original data has the maximum value as expected. Then follow PDFs that are estimated using histogram bin (centered and uniform). After them there is ground truth PDF. And finally comes PDF with the lowest likelihood, which is estimated using least squares.
Given an undirected NetworkX Graph graph, I want to check if it is scale free.
To do this, as I understand, I need to find the degree k of each node, and the frequency of that degree P(k) within the entire network. This should represent a power law curve due to the relationship between the frequency of degrees and the degrees themselves.
Plotting my calculations for P(k) and k displays a power curve as expected, but when I double log it, a straight line is not plotted.
The following plots were obtained with a 1000 nodes.
Code as follows:
k = []
Pk = []
for node in list(graph.nodes()):
degree = graph.degree(nbunch=node)
try:
pos = k.index(degree)
except ValueError as e:
k.append(degree)
Pk.append(1)
else:
Pk[pos] += 1
# get a double log representation
for i in range(len(k)):
logk.append(math.log10(k[i]))
logPk.append(math.log10(Pk[i]))
order = np.argsort(logk)
logk_array = np.array(logk)[order]
logPk_array = np.array(logPk)[order]
plt.plot(logk_array, logPk_array, ".")
m, c = np.polyfit(logk_array, logPk_array, 1)
plt.plot(logk_array, m*logk_array + c, "-")
The m is supposed to represent the scaling coefficient, and if it's between 2 and 3 then the network ought to be scale free.
The graphs are obtained by calling the NetworkX's scale_free_graph method, and then using that as input for the Graph constructor.
Update
As per request from #Joel, below are the plots for 10000 nodes.
Additionally, the exact code that generates the graph is as follows:
graph = networkx.Graph(networkx.scale_free_graph(num_of_nodes))
As we can see, a significant amount of the values do seem to form a straight-line, but the network seems to have a strange tail in its double log form.
Have you tried powerlaw module in python?
It's pretty straightforward.
First, create a degree distribution variable from your network:
degree_sequence = sorted([d for n, d in G.degree()], reverse=True) # used for degree distribution and powerlaw test
Then fit the data to powerlaw and other distributions:
import powerlaw # Power laws are probability distributions with the form:p(x)∝x−α
fit = powerlaw.Fit(degree_sequence)
Take into account that powerlaw automatically find the optimal alpha value of xmin by creating a power law fit starting from each unique value in the dataset, then selecting the one that results in the minimal Kolmogorov-Smirnov distance,D, between the data and the fit. If you want to include all your data, you can define xmin value as follow:
fit = powerlaw.Fit(degree_sequence, xmin=1)
Then you can plot:
fig2 = fit.plot_pdf(color='b', linewidth=2)
fit.power_law.plot_pdf(color='g', linestyle='--', ax=fig2)
which will produce an output like this:
powerlaw fit
On the other hand, it may not be a powerlaw distribution but any other distribution like loglinear, etc, you can also check powerlaw.distribution_compare:
R, p = fit.distribution_compare('power_law', 'exponential', normalized_ratio=True)
print (R, p)
where R is the likelihood ratio between the two candidate distributions. This number will be positive if the data is more likely in the first distribution, but you should also check p < 0.05
Finally, once you have chosen a xmin for your distribution you can plot a comparisson between some usual degree distributions for social networks:
plt.figure(figsize=(10, 6))
fit.distribution_compare('power_law', 'lognormal')
fig4 = fit.plot_ccdf(linewidth=3, color='black')
fit.power_law.plot_ccdf(ax=fig4, color='r', linestyle='--') #powerlaw
fit.lognormal.plot_ccdf(ax=fig4, color='g', linestyle='--') #lognormal
fit.stretched_exponential.plot_ccdf(ax=fig4, color='b', linestyle='--') #stretched_exponential
lognornal vs powerlaw vs stretched exponential
Finally, take into account that powerlaw distributions in networks are being under discussion now, strongly scale-free networks seem to be empirically rare
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6399239/
Part of your problem is that you aren't including the missing degrees in fitting your line. There are a small number of large degree nodes, which you're including in your line, but you're ignoring the fact that many of the large degrees don't exist. Your largest degrees are somewhere in the 1000-2000 range, but there are only 2 observations. So really, for such large values, I'm expecting that the probability a random node has such a large degree 2/(1000*N) (or really, it's probably even less than that). But in your fit, you're treating them as if the probability of those two specific degrees is 2/N, and you're ignoring the other degrees.
The simple fix is to only use the smaller degrees in your fit.
The more robust way is to fit the complementary cumulative distribution. Instead of plotting P(K=k), plot P(K>=k) and try to fit that (noting that if the probability that P(K=k) is a powerlaw, then the probability that P(K>=k) is also, but with a different exponent - check it).
Trying to fit a line to these points is wrong, as the points are not linearly distributed over the x-axis. The fitting function of line will give more importance to the portion of the domain that contain more points.
You should redistribute the observations over the x-axis using function np.interp, like this.
logk_interp = np.linspace(np.min(logk_array),np.max(logk_array),1000)
logPk_interp = np.interp(logk_interp, logk_array, logPk_array)
plt.plot(logk_array, logPk_array,".")
m, c = np.polyfit(logk_interp, logPk_interp, 1)
plt.plot(logk_interp, m*logk_interp + c, "-")
Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.
Here is a MWE to display a histogram with normally distributed sample data:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()
Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.
I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.
Density Functions
This will do what you want for a density function:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)
Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.
Histograms
Here is a solution that gives approximately equal area boxes for a histogram:
def histedges_equalA(x, nbin):
pow = 0.5
dx = np.diff(np.sort(x))
tmp = np.cumsum(dx ** pow)
tmp = np.pad(tmp, (1, 0), 'constant')
return np.interp(np.linspace(0, tmp.max(), nbin + 1),
tmp,
np.sort(x))
n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)
These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.
Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).
The issue with equal-area histograms
As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:
Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.
I'm trying to make a histogram of the radial distribution of a circular scatterring of particles, and I'm trying to scale the histogram so that the radial distribution is in particles per unit area.
Disclaimer: If you don't care about the math behind what I'm talking about, just skip over this section:
I'm splitting the radial distribution in to annuluses of equal width, going out from the center. So, in the center, I will have a circle of some radius, a. The area of this inner most portion will be $\pi a^{2}$.
Now if we want to know the area of the annulus going from radial distance a to 2a, we do $$ \int_{a}^{2a} 2 \pi r \ dr = 3 \pi a^{2} $$
Continuing in a similar fashion (going from 2a to 3a, 3a to 4a, etc.) we see that the areas increase as follows: $$ Areas = \pi a^{2}, 3 \pi a^{2}, 5 \pi a^{2}, 7 \pi a^{2}, ... $$
So, when I weight the histogram for the radial distribution of my scatter, going out from the center, each bin will have to be weighted so that the count of first bin is left alone, the count of the second bin is divided by 3, the count of the third bin is divided by 5, etc, etc.
So: Here's my try at the code:
import numpy as np
import matplotlib.pyplot as plt
# making random sample of 100000 points between -2.5 and 2.5
y_vec = 5*np.random.random(100000) - 2.5
z_vec = 5*np.random.random(100000) - 2.5
# blank canvasses for the y, z, and radial arrays
y_vec2 = []
z_vec2 = []
R_vec = []
# number of bins I want in the ending histogram
bns = 40
# cutting out the random samplings that aren't in a circular distribution
# and making the radial array
for i in range(0, 100000):
if np.sqrt((y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i])) <= 2.5:
y_vec2.append(y_vec[i])
z_vec2.append(z_vec[i])
R_vec.append(np.sqrt(y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i]))
# setting up the figures and plots
fig, ax = plt.subplots()
fig2, hst = plt.subplots()
# creating a weighting array for the histogram
wghts = []
i = 0
c = 1
# making the weighting array so that each of the bins will be weighted correctly
# (splitting the radial array up evenly in to groups of the size the bins will be
# and weighting them appropriately). I assumed the because the documentation says
# the "weights" array has to be the same size as the "x" initial input, that the
# weights act on each point individually...
while i < bns:
wghts.extend((1/c)*np.ones(len(R_vec)/bns))
c = c + 2
i = i + 1
# Making the plots
ax.scatter(y_vec2, z_vec2)
hst.hist(R_vec, bins = bns, weights = wghts)
# plotting
plt.show()
The scatter plot looks great:
But, the radial plot suggest that I got the weighting wrong. It should be constant across all annuli, but it is increasing, as though it were not weighted at all:
The erratic look of the Radial Distribution suggests to me that the weighting function in the "hist" operator weights each member of R_vec individually instead of weighting the bins.
How would I weight the bins by the factors I need to scale them by? Any help?
You are correct when you surmise that the weights weight the individual values and not the bins. This is documented:
Each value in x only contributes its associated weight towards the bin count (instead of 1).
Therefore the basic problem is that, in calculating the weights, you aren't taking account of the order of the points. You created points at random, but then you create the weights in sequence from greatest to least. This means you're not assigning the right weights to the right points.
The way you should create the weights is by directly computing each point's weight from its radius. The way you seem to want to do this is by discretizing the radius into a binned radius, then weighting inversely by that. Instead of what you're doing for the weights, try this:
R_vec = np.array(R_vec)
wghts = 1 / (2*(R_vec//(2.5/bns))+1)
This gives me the right result:
You can also get essentially the same result without doing the binning in the weighting --- that is, just directly weight each point by the reciporcal of its radius:
R_vec = np.array(R_vec)
wghts = 1 / R_vec
The advantage of doing this is that you can then plot a histogram a different number of bins without recomputing the weights. It also makes somewhat more conceptual sense to weight each point by how far out it is in a continuous sense, not by whether it falls on one side or the other of a discrete bin boundary.
When you want to plot something "per unit area", use area as your independent variable.
This way, you can still use a histogram if you like, but you don't have to worry about non-uniform binning or weighting.
I replaced your line:
hst.hist(R_vec, bins = bns, weights = wghts)
with:
hst.hist(np.pi*np.square(R_vec),bins=bns)