Related
I was given this assignment, and I'm not sure if i understand the question correctly.
We considered the sample-mean estimator for the distribution mean. Another estimator for the distribution mean is the min-max-mean estimator that takes the mean (average) of the smallest and largest observed values. For example, for the sample {1, 2, 1, 5, 1}, the sample mean is (1+2+1+5+1)/5=2 while the min-max-mean is (1+5)/2=3. In this problem we ask you to run a simulation that approximates the mean squared error (MSE) of the two estimators for a uniform distribution.
Take a continuous uniform distribution between a and b - given as parameters. Draw a 10-observation sample from this distribution, and calculate the sample-mean and the min-max-mean. Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates.
Sample Input: Sample_Mean_MSE(1, 5)
Sample Output: 0.1343368663225577
This code below is me trying to:
Draw a sample of size 10 from a uniform distribution of a and b
calculate MSE, with mean calculated with Sample Mean method.
Repeat 100,000 times, and store the result MSEs in an array
Return the mean of the MSEs array, as the final result
However, the result I get was quite far from the sample output above.
Can someone clarify the assignment, around the part "Repeat the experiment 100,000 times, and for each estimator calculate its average bias as your MSE estimates"? Thanks
import numpy as np
def Sample_Mean_MSE(a, b):
# inputs: bounds for uniform distribution a and b
# sample size is 10
# number of experiments is 100,000
# output: MSE for sample mean estimator with sample size 10
mse_s = np.array([])
k = 0
while k in range(100000):
sample = np.random.randint(low=a, high=b, size=10)
squared_errors = np.array([])
for i, value in enumerate(sample):
error = value - sample.mean()
squared_errors = np.append(squared_errors, error ** 2)
k += 1
mse_s = np.append(mse_s, squared_errors.mean())
return mse_s.mean()
print(Sample_Mean_MSE(1, 5))
To get the expected result we first need to understand what the Mean squared error (MSE) of an estimator is. Take the sample-mean estimator for example (min-max-mean estimator is basically the same):
MSE assesses the average squared difference between the observed and predicted values - in this case, is between the distribution mean and the sample mean. We can break it down as below:
Draw a sample of size 10 and calculate the sample mean (ŷ), repeat 100,000 times
Calculate the distribution mean: y = (a + b)/2
Calculate and return the MSE: MSE = 1/n * Σ(y - ŷ)^2
I'm looking for a function which calculates the n-th central moment
(same as the one out of scipy.stats.moment)
for my binned data (Out of the numpy.histogram function).
# Generate normal distributed data
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(size=500,loc=1,scale=2)
H = np.histogram(data,bins=50)
plt.scatter(H[1][:-1],H[0])
plt.show()
for my above code example the results should be (0,4,0,48) for the first four moments as there sigma = 2 (for the central moment).
Working with binned data is essentially the same as working with weighted data. One uses the midpoint of each bin as a data point, and the count of that bin as its weight. If scipy.stats.moment supported weights, we could do this computation directly. As is, use the method numpy.average which supports weights.
midpoints = 0.5 * (H[1][1:] + H[1][:-1])
ev = np.average(midpoints, weights = H[0])
print(ev)
for k in range(2, 5):
print(np.average((midpoints - ev)**k, weights = H[0]))
Output (obviously random):
1.08242834443
4.21602099286
0.713129264647
51.6257736139
I didn't print the centered 1st moment (which is 0 by construction), printing the expected value instead. Theoretically*, these are 1, 4, 0, 48 but for any given sample, there is going to be some deviation from the parameters of the distribution.
(*) Not exactly. In the formula for variance I didn't include the correction factor n/(n-1) (where n is the total size of data set, i.e., the sum of weights). This factor adjusts the sample variance so it becomes an unbiased estimator of the population variance. You can include it if you like. Similar adjustments are probably needed for higher-order moments (if the goal is to have unbiased estimators), but I'd have to look this up, and in any case this is not a statistics site.
Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:
data = [300,244,543,1011,300,125,300 ... ]
And I would like to know the probability of a value occurring given this data.
I graphed histograms of the data using matplotlib and obtained these:
In the first graph, the numbers represent the amount of characters in a sequence. In the second graph, it's a measured amount of time in milliseconds. The minimum is greater than zero, but there isn't necessarily a maximum. The graphs were created using millions of examples, but I'm not sure I can make any other assumptions about the distribution. I want to know the probability of a new value given that I have a few million examples of values. In the first graph, I have a few million sequences of different lengths. Would like to know probability of a 200 length, for example.
I know that for a continuous distribution the probability of any exact point is supposed to be zero, but given a stream of new values, I need be able to say how likely each value is. I've looked through some of the numpy/scipy probability density functions, but I'm not sure which to choose from or how to query for new values once I run something like scipy.stats.norm.pdf(data). It seems like different probability density functions will fit the data differently. Given the shape of the histograms I'm not sure how to decide which to use.
Since you don't seem to have a specific distribution in mind, but you might have a lot of data samples, I suggest using a non-parametric density estimation method. One of the data types you describe (time in ms) is clearly continuous, and one method for non-parametric estimation of a probability density function (PDF) for continuous random variables is the histogram that you already mentioned. However, as you will see below, Kernel Density Estimation (KDE) can be better. The second type of data you describe (number of characters in a sequence) is of the discrete kind. Here, kernel density estimation can also be useful and can be seen as a smoothing technique for the situations where you don't have a sufficient amount of samples for all values of the discrete variable.
Estimating Density
The example below shows how to first generate data samples from a mixture of 2 Gaussian distributions and then apply kernel density estimation to find the probability density function:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity
# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
10 + np.random.randn(30, 1)))
# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)
# Plot the data using a normalized histogram
plt.hist(data, 50, normed=True)
# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)
# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)
# Show the plots
plt.show()
This will produce the following plot, where the true distribution is shown in blue, the histogram is shown in green, and the PDF estimated using KDE is shown in red:
As you can see, in this situation, the PDF approximated by the histogram is not very useful, while KDE provides a much better estimate. However, with a larger number of data samples and a proper choice of bin size, histogram might produce a good estimate as well.
The parameters you can tune in case of KDE are the kernel and the bandwidth. You can think about the kernel as the building block for the estimated PDF, and several kernel functions are available in Scikit Learn: gaussian, tophat, epanechnikov, exponential, linear, cosine. Changing the bandwidth allows you to adjust the bias-variance trade-off. Larger bandwidth will result in increased bias, which is good if you have less data samples. Smaller bandwidth will increase variance (fewer samples are included into the estimation), but will give a better estimate when more samples are available.
Calculating Probability
For a PDF, probability is obtained by calculating the integral over a range of values. As you noticed, that will lead to the probability 0 for a specific value.
Scikit Learn does not seem to have a builtin function for calculating probability. However, it is easy to estimate the integral of the PDF over a range. We can do it by evaluating the PDF multiple times within the range and summing the obtained values multiplied by the step size between each evaluation point. In the example below, N samples are obtained with step step.
# Get probability for range of values
start = 5 # Start of the range
end = 6 # End of the range
N = 100 # Number of evaluation points
step = (end - start) / (N - 1) # Step size
x = np.linspace(start, end, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
print(probability)
Please note that kd.score_samples generates log-likelihood of the data samples. Therefore, np.exp is needed to obtain likelihood.
The same computation can be performed using builtin SciPy integration methods, which will give a bit more accurate result:
from scipy.integrate import quad
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
For instance, for one run, the first method calculated the probability as 0.0859024655305, while the second method produced 0.0850974209996139.
OK I offer this as a starting point, but estimating densities is a very broad topic. For your case involving the amount of characters in a sequence, we can model this from a straight-forward frequentist perspective using empirical probability. Here, probability is essentially a generalization of the concept of percentage. In our model, the sample space is discrete and is all positive integers. Well, then you simply count the occurrences and divide by the total number of events to get your estimate for the probabilities. Anywhere we have zero observations, our estimate for the probability is zero.
>>> samples = [1,1,2,3,2,2,7,8,3,4,1,1,2,6,5,4,8,9,4,3]
>>> from collections import Counter
>>> counts = Counter(samples)
>>> counts
Counter({1: 4, 2: 4, 3: 3, 4: 3, 8: 2, 5: 1, 6: 1, 7: 1, 9: 1})
>>> total = sum(counts.values())
>>> total
20
>>> probability_mass = {k:v/total for k,v in counts.items()}
>>> probability_mass
{1: 0.2, 2: 0.2, 3: 0.15, 4: 0.15, 5: 0.05, 6: 0.05, 7: 0.05, 8: 0.1, 9: 0.05}
>>> probability_mass.get(2,0)
0.2
>>> probability_mass.get(12,0)
0
Now, for your timing data, it is more natural to model this as a continuous distribution. Instead of using a parametric approach where you assume that your data has some distribution and then fit that distribution to your data, you should take a non-parametric approach. One straightforward way is to use a kernel density estimate. You can simply think of this as a way of smoothing a histogram to give you a continuous probability density function. There are several libraries available. Perhaps the most straightforward for univariate data is scipy's:
>>> import scipy.stats
>>> kde = scipy.stats.gaussian_kde(samples)
>>> kde.pdf(2)
array([ 0.15086911])
To get the probability of an observation in some interval:
>>> kde.integrate_box_1d(1,2)
0.13855869478828692
Here is one possible solution. You count the number of occurrences of each value in the original list. The future probability for a given value is its past rate of occurrence, which is simply the # of past occurrences divided by the length of the original list. In Python it's very simple:
x is the given list of values
from collections import Counter
c = Counter(x)
def probability(a):
# returns the probability of a given number a
return float(c[a]) / len(x)
I have a question:
Given mean and variance I want to calculate the probability of a sample using a normal distribution as probability basis.
The numbers are:
mean = -0.546369
var = 0.006443
curr_sample = -0.466102
prob = 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) )
I get a probability which is larger than 1! I get prob = 3.014558...
What is causing this? The fact that the variance is too small messes something up? It's a totally legal input to the formula and should give something small not greater than 1! Any suggestions?
Ok, what you compute is not a probability, but a probability density (which may be larger than one). In order to get 1 you have to integrate over the normal distribution like so:
import numpy as np
mean = -0.546369
var = 0.006443
curr_sample = np.linspace(-10,10,10000)
prob = np.sum( 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) ) * (curr_sample[1]-curr_sample[0]) )
print prob
witch results in
0.99999999999961509
The formula you give is a probability density, not a probability. The density formula is such that when you integrate it between two values of x, you get the probability of being in that interval. However, this means that the probability of getting any particular sample is, in fact, 0 (it's the density times the infinitesimally small dx).
So what are you actually trying to calculate? You probably want something like the probability of getting your value or larger, the so-called tail probability, which is often used in statistics (it so happens that this is given by the error function when you're talking about a normal distribution, although you need to be careful of exactly how it's defined).
When considering the bell-shaped probability distribution function (PDF) of given mean and variance, the peak value of the curve (height of mode) is 1/sqrt(2*pi*var). It is 1 for standard normal distribution (mean 0 and var 1). Hence when trying to calculate a specific value of a general normal distribution pdf, values larger than 1 are possible.
I am trying to find the power spectral density of a signal measured at uneven times. The data looks something like this:
0 1.55
755 1.58
2412256 2.42
2413137 0.32
2497761 1.19
...
where the first column is the time since the first measurement (in seconds) and the second column is the value of the measurement.
Currently, using the periodogram function in Matlab, I have been able to estimate the power spectral density by using:
nfft = length(data(:,2));
pxx = periodogram(data(:,2),[],nfft);
Now at the moment, to plot this I have been using
len = length(pxx);
num = 1:1:len;
plot(num,pxx)
Which clearly does not place the correct x-axis on the power spectral density (and yields something like the plot below), which needs to be in frequency space. I am confused about how to go about this given the uneven sampling of the data.
What is the correct way to convert to (and then plot in) frequency space when estimating the power spectral density for data that has been unevenly sampled? I am also interested in tackling this from a python/numpy/scipy perspective but have so far only looked at the Matlab function.
I am not aware of any functions that calculate a PSD from irregulary sampled data, so you need to convert the data to a uniform sample rate first. So the first step is to use interp1 to resample at regular time intervals.
avg_fs = 1/mean(diff(data(:, 1)));
min_time = min(data(:, 1));
max_time = max(data(:, 1));
num_pts = floor((max_time - min_time) * avg_fs);
new_time = (1:num_pts)' / avg_fs;
new_time = new_time - new_time(1) + min_time;
new_x = interp1(data(:, 1), data(:, 2), new_time);
I always use pwelch for calculating PSD's, here is how I would go about it
nfft = 512; % play with this to change your frequency resolution
noverlap = round(nfft * 0.75); % 75% overlap
window = hanning(nfft);
[Pxx,F] = pwelch(new_x, window, noverlap, nfft, avg_fs);
plot(F, Pxx)
xlabel('Frequency (Hz)')
grid on
You will definitely want to experiment with nfft, larger numbers will give you more frequency resolution (smaller spacing between frequencies), but the PSD will be noisier. One trick you can do to get fine resolution and low noise is to make the window smaller than nfft.