I have been trying to calculate an autocorrelation function, as defined in statistical mechanics, using numpy. Most of the documentation I found is relative to functions like correlate and convolve. However, for a given random variable x these functions just seem to calculate the sum
ACF(dt) = sum_{t=0}^T [(x(t)*x(t+dt)]
instead of the average
ACF(dt) = mean[x(t)*x(t+dt)]
so in fact for calculating an autocorrelation function one would need to do something like:
acf = np.correlate(x,x,mode='full')
acf_half = acf[acf.size / 2:]
ldata = len(acf)
acf = np.array([x/(ldata-i) for i,x in enumerate(acf_half)])
Of course we would need to subtract mean(x)**2 from the resulting acf to be correct.
Can anyone confirm that this is correct?
Generally speaking, the autocorrelation, correlation, etc. is the sum (integral). Sometimes it is normalized, but not averaged in the sense as you've written above. This is because they are defined in terms of the mathematical convolution operation, which is simply the integral that you've written as a sum above.
The brackets at the stat mech page indicate a thermal average, which is an ensemble or time average over the 'experiment' taking place many times at many different states at some temperature. This (the finite temperature) causes the fluctuations that give rise to the 'statistical' nature of the problem, and cause the decay of the correlation (loss of long range order). This simply means that you should find the autocorrelation of several datasets, and average those together, but do not take the mean of the function.
As far as I can tell, your code is attempting to weigh the correlation at dt by the length of the overlap length dt, but I do not believe that this is correct.
With respect to the subtraction of <s>2, that's in the case of the spin model, where <s> would be the mean spin (magnetization), so I believe you are correct in that you should use mean(x)**2.
As a side-note, I would suggest using mode='same' instead of 'full' so that the domain of your correlation matches the domain of your input without having to look at just one-half of the output (here the output is symmetric, so it doesn't really make a difference).
Related
I have the a dataframe which includes heights. The data can not go below zero. That's why i can not use standard deviation as this data is not a normal distribution. I can not use 68-95-99.7 rule here because it fails in my case. Here is my dataframe, mean and SD.
0.77132064
0.02075195
0.63364823
0.74880388
0.49850701
0.22479665
0.19806286
0.76053071
0.16911084
0.08833981
Mean: 0.41138725956196015
Std: 0.2860541519582141
If I get 2 std, as you can see the number becomes negative.
-2 x std calculation = 0.41138725956196015 - 0.2860541519582141 x 2 = -0,160721044354468
I have tried using percentile and not satisfied with it to be honest. How can i apply Chebyshev's inequality to this problem? Here what i did so far:
np.polynomial.Chebyshev(df['Heights'])
But this returns numbers not a SD level i can measure. Or do you think Chebyshev is the best choice in my case?
Expected solution:
I am expecting to get a range like 75% next height will be between 0.40 - 0.43 etc.
EDIT1: Added histogram
To be more clear, I have added my real data's histogram
EDIT2: Some values from real data
Mean: 0.007041500928135767
Percentile 50: 0.0052000000000000934
Percentile 90: 0.015500000000000047
Std: 0.0063790857035425025
Var: 4.06873389299246e-05
Thanks a lot
You seem to be confusing two ideas from the same mathematician, Chebyshev. These ideas are not the same.
Chebysev's inequality states a fact that is true for many probability distributions. For two standard deviations, it states that three-fourths of the data items will lie within two standard deviations from the mean. As you state, for normal distributions about 19/20 of the items will lie in that interval, but Chebyshev's inequality is an absolute bound that is met by practically all distributions. The fact that your data values are never negative does not change the truth of the inequality; it just makes the actual proportion of values in the interval even larger, so the inequality is even more true (in a sense).
Chebyshev polynomials do not involve statistics, but are simply a series (or two series) of polynomials, commonly used in calculating approximations for computer functions. That is what np.polynomial.Chebyshev involves, and therefore does not seem useful to you at all.
So calculate Chebyshev's inequality yourself. There is no need for a special function for that, since it is so easy (this is Python 3 code):
def Chebyshev_inequality(num_std_deviations):
return 1 - 1 / num_std_deviations**2
You can change that to handle the case where k <= 1 but the idea is obvious.
In your particular case: the inequality says that at least 3/4, or 75%, of the data items will lie within 2 standard deviations of the mean, which means more than 0.41138725956196015 - 2 * 0.2860541519582141 and less than than 0.41138725956196015 + 2 * 0.2860541519582141 (note the different signs), which simplifies to the interval
[-0.16072104435446805, 0.9834955634783884]
In your data, 100% of your data values are in that interval, so Chebyshev's inequality was correct (of course).
Now, if your goal is to predict or estimate where a certain percentile is, Chebyshev's inequality does not help much. It is an absolute lower bound, so it gives one limit to a percentile. For example, by what we did above we know that the 12.5'th percentile is at or above -0.16072104435446805 and the 87.5'th percentile is at or below 0.9834955634783884. Those facts are true but are probably not what you want. If you want an estimate that is closer to the actual percentile, this is not the way to go. The 68-95-99.7 rule is an estimate--the actual locations may be higher or lower, but if the distribution is normal than the estimate will not be far off. Chebyshev's inequality does not do that kind of estimate.
If you want to estimate the 12.5'th and 87.5'th percentiles (showing where 75 percent of all the population will fall) you should calculate those percentiles of your sample and use those values. If you don't know more details about the kind of distribution you have, I don't see any better way. There are reasons why normal distributions are so popular!
It sounds like you want the boundaries for the middle 75% of your data.
The middle 75% of the data is between the 12.5th percentile and the 87.5th percentile, so you can use the quantile function to get the values at the locations:
[df['Heights'].quantile(0.5 - 0.75/2), df['Heights'].quantile(0.5 + 0.75/2)]
#[0.09843618875, 0.75906485625]
As per What does it mean when the standard deviation is higher than the mean? What does that tell you about the data? - Quora, SD is a measure of "spread" and mean is a measure of "position". As you can see, these are more or less independent things. Now, if all your samples are positive, SD cannot be greater than the mean because of the way it's calculated, but 2 or 3 SDs very well can.
So, basically, SD being roughly equal to the mean means that your data are all over the place.
Now, a random variable that's strictly positive indeed cannot be normally distributed. But for a rough estimation, seeing that you still have a bell shape, we can pretend it is and still use SD as a rough measure of the spread (though, since 2 and 3 SD can go into negatives, they lack any physical meaning here whatsoever and so are unusable for the sake of our pretention):
E.g. to get a rough prediction of grass growth, you can still take the mean and apply whatever growth model you're using to it -- that will get the new, prospective mean. Then applying the same to mean±SD will give an idea of the new SD.
This is very rough, of course. But to get any better, you'll have to somehow check which distribution you're dealing with and use its peak and spread characteristics instead of mean and SD. And in any case, your prediction will not be any better than your growth model -- studies of which are anything but conclusive judging by e.g. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3040.2005.01490.x (not a single formula there).
Given a list of values:
>>> from scipy import stats
>>> import numpy as np
>>> x = list(range(100))
Using student t-test, I can find the confidence interval of the distribution at the mean with an alpha of 0.1 (i.e. at 90% confidence) with:
def confidence_interval(alist, v, itv):
return stats.t.interval(itv, df=len(alist)-1, loc=v, scale=stats.sem(alist))
x = list(range(100))
confidence_interval(x, np.mean(x), 0.1)
[out]:
(49.134501289005009, 49.865498710994991)
But if I were to find the confidence interval at every datapoint, e.g. for the value 10:
>>> confidence_interval(x, 10, 0.1)
(9.6345012890050086, 10.365498710994991)
How should the interval of the values be interpreted? Is it statistically/mathematical sound to interpret that at all?
Does it goes something like:
At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991),
aka.
At 90% confidence, we can say that the data point falls at 10 +- 0.365...
So can we interpret the interval as some sort of a box plot of the datapoint?
In short
Your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, since your distribution is clearly not normal, and because 10 is not the observed mean.
TL;DR
There are a lot misconceptions floating around confidence intervals, most of which seemingly stems from a misunderstanding of what we are confident about. Since there is some confusion in your understanding of confidence interval maybe a broader explanation will give a deeper understanding of the concepts you are handling, and hopefully definitely rule out any source of error.
Clearing out misconceptions
Very briefly to set things up. We are in a situation where we want to estimate a parameter, or rather, we want to test a hypothesis for the value of a parameter parameterizing the distribution of a random variable. e.g: Let's say I have a normally distributed variable X with mean m and standard deviation sigma, and I want to test the hypothesis m=0.
What is a parametric test
This a process for testing a hypothesis on a parameter for a random variable. Since we only have access to observations which are concrete realizations of the random variable, it generally procedes by computing a statistic of these realizations. A statistic is roughly a function of the realizations of a random variable. Let's call this function S, we can compute S on x_1,...,x_n which are as many realizations of X.
Therefore you understand that S(X) is a random variable as well with distribution, parameters and so on! The idea is that for standard tests, S(X) follows a very well known distribution for which values are tabulated. e.g: http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf
What is a confidence interval?
Given what we've just said, a definition for a confidence interval would be: the range of values for the tested parameter, such that if the observations were to have been generated from a distribution parametrized by a value in that range, it would not have probabilistically improbable.
In other words, a confidence interval gives an answer to the question: given the following observations x_1,...,x_n n realizations of X, can we confidently say that X's distribution is parametrized by such value. 90%, 95%, etc... asserts the level of confidence. Usually, external constraints fix this level (industrial norms for quality assessment, scientific norms e.g: for the discovery of new particles).
I think it is now intuitive to you that:
The higher the confidence level, the larger the confidence interval. e.g. for a confidence of 100% the confidence interval would range across all the possible values as soon as there is some uncertainty
For most tests, under conditions I won't describe, the more observations we have, the more we can restrain the confidence interval.
At 90% confidence, we know that the data point 10 falls in the interval (9.6345012890050086, 10.365498710994991)
It is wrong to say that and it is the most common source of mistakes. A 90% confidence interval never means that the estimated parameter has 90% percent chance of falling into that interval. When the interval is computed, it covers the parameter or it does not, it is not a matter of probability anymore. 90% is an assessment of the reliability of the estimation procedure.
What is a student test?
Now let's come to your example and look at it under the lights of what we've just said. You to apply a Student test to your list of observations.
First: a Student test aims at testing a hypothesis of equality between the mean m of a normally distributed random variable with unknown standard deviation, and a certain value m_0.
The statistic associated with this test is t = (np.mean(x) - m_0)/(s/sqrt(n)) where x is your vector of observations, n the number of observations and s the empirical standard deviation. With no surprise, this follows a Student distribution.
Hence, what you want to do is:
compute this statistic for your sample, compute the confidence interval associated with a Student distribution with this many degrees of liberty, this theoretical mean, and confidence level
see if your computed t falls into that interval, which tells you if you can rule out the equality hypothesis with such level of confidence.
I wanted to give you an exercise but I think I've been lengthy enough.
To conclude on the use of scipy.stats.t.interval. You can use it one of two ways. Either computing yourself the t statistic with the formula shown above and check if t fits in the interval returned by interval(alpha, df) where df is the length of your sampling. Or you can directly call interval(alpha, df, loc=m, scale=s) where m is your empirical mean, and s the empirical standard deviatation (divided by sqrt(n)). In such case, the returned interval will directly be the confidence interval for the mean.
So in your case your call gives the interval of confidence for the mean parameter of a normal law of unknown parameters of which you observed 100 observations with an average of 10 and a stdv of 29. It is furthermore not sound to interpret it, beside the error of interpretation I've already pointed out, since your distribution is clearly not normal, and because 10 is not the observed mean.
Resources
You can check out the following resources to go further.
wikipedia links to have quick references and an elborated overview
https://en.wikipedia.org/wiki/Confidence_interval
https://en.wikipedia.org/wiki/Student%27s_t-test
https://en.wikipedia.org/wiki/Student%27s_t-distribution
To go further
http://osp.mans.edu.eg/tmahdy/papers_of_month/0706_statistical.pdf
I haven't read it but the one below seems quite good.
https://web.williams.edu/Mathematics/sjmiller/public_html/BrownClasses/162/Handouts/StatsTests04.pdf
You should also check out p-values, you will find a lot of similarities and hopefully you understand them better after reading this post.
https://en.wikipedia.org/wiki/P-value#Definition_and_interpretation
Confidence intervals are hopelessly counter-intuitive. Especially for programmers, I dare say as a programmer.
Wikipedida uses a 90% confidence to illustrate a possible interpretation:
Were this procedure to be repeated on numerous samples, the fraction of calculated confidence intervals (which would differ for each sample) that encompass the true population parameter would tend toward 90%.
In other words
The confidence interval provides information about a statistical parameter (such as the mean) of a sample.
The interpretation of e.g. a 90% confidence interval would be: If you repeat the experiment an infinite number of times 90% of the resulting confidence intervals will contain the true parameter.
Assuming the code to compute the interval is correct (which I have not checked) you can use it to calculate the confidence interval of the mean (because of the t-distribution, which models the sample mean of a normally distributed population with unknown standard deviation).
For practical purposes it makes sense to pass in the sample mean. Otherwise you are saying "if I pretended my data had a sample mean of e.g. 10, the confidence interval of the mean would be [9.6, 10.3]".
The particular data passed into the confidence interval does not make sense either. Numbers increasing in a range from 0 to 99 are very unlikely to be drawn from a normal distribution.
I'm working with extremely noisy data occasionally peppered with outliers, so I'm relying mostly on correlation as a measure of accuracy in my NN.
Is it possible to explictly use something like rank correlation (the Spearman correlation coefficient) as my cost function? Up to now, I've relied mostly on MSE as a proxy for correlation.
I have three major stumbling blocks right now:
1) The notion of ranking becomes much fuzzier with mini-batches.
2) How do you dynamically perform rankings? Will TensorFlow not have a gradient error/be unable to track how a change in a weight/bias affects the cost?
3) How do you determine the size of the tensors you're looking at during runtime?
For example, the code below is what I'd like to roughly do if I were to just use correlation. In practice, length needs to be passed in rather than determined at runtime.
length = tf.shape(x)[1] ## Example code. This line not meant to work.
original_loss = -1 * length * tf.reduce_sum(tf.mul(x, y)) - (tf.reduce_sum(x) * tf.reduce_sum(y))
divisor = tf.sqrt(
(length * tf.reduce_sum(tf.square(x)) - tf.square(tf.reduce_sum(x))) *
(length * tf.reduce_sum(tf.square(y)) - tf.square(tf.reduce_sum(y)))
)
original_loss = tf.truediv(original_loss, divisor)
Here is the code for the Spearman correlation:
predictions_rank = tf.nn.top_k(predictions_batch, k=samples, sorted=True, name='prediction_rank').indices
real_rank = tf.nn.top_k(real_outputs_batch, k=samples, sorted=True, name='real_rank').indices
rank_diffs = predictions_rank - real_rank
rank_diffs_squared_sum = tf.reduce_sum(rank_diffs * rank_diffs)
six = tf.constant(6)
one = tf.constant(1.0)
numerator = tf.cast(six * rank_diffs_squared_sum, dtype=tf.float32)
divider = tf.cast(samples * samples * samples - samples, dtype=tf.float32)
spearman_batch = one - numerator / divider
The problem with the Spearman correlation is that you need to use a sorting algorithm (top_k in my code). And there is no way to translate it to a loss value. There is no derivade of a sorting algorithm. You can use a normal correlation but I think there is no mathematically difference to use the mean squared error.
I am working on this right now for images. What I have read in papers that they use to add the ranking into the loss function is to compare 2 or 3 images (where I say images you can say anything you want to rank).
Comparing two elements:
Where N is the total number of elements and α a margin value. I got this equation from Photo Aesthetics Ranking Network with Attributes and Content Adaptation
You can also use losses with 3 elemens where you compare two of them with similar ranking with another one with a different one:
But in this equation you also need to add the direction of the ranking, more details in Will People Like Your Image?. In the case of this paper they use a vector encodig instead of a real value but you can do it for just a number too.
In the case of images, the comparison between images makes more sense when those images are related. So it is a good idea to run a clustering algorithm to create (maybe?) 10 clusters, so you can use elements of the same cluster to make comparisons instead of very different things. This will help the network as the inputs are related somehow and not completely different.
As a side note you should know what is more important for you, if it is the final rank order or the rank value. If it is the value you should go with mean square error, if it is the rank order you can use the losses I wrote before. Or you can even combine them.
How do you determine the size of the tensors you're looking at during runtime?
tf.shape(tensor) returns a tensor with the shape. Then you can use tf.gather(tensor,index) to get the value you want.
I have a set of measured radii (t+epsilon+error) at an equally spaced angles.
The model is circle of radius (R) with center at (r, Alpha) with added small noise and some random error values which are much bigger than noise.
The problem is to find the center of the circle model (r,Alpha) and the radius of the circle (R). But it should not be too much sensitive to random error (in below data points at 7 and 14).
Some radii could be missing therefore the simple mean would not work here.
I tried least square optimization but it significantly reacts on error.
Is there a way to optimize least deltas but not the least squares of delta in Python?
Model:
n=36
R=100
r=10
Alpha=2*Pi/6
Data points:
[95.85, 92.66, 94.14, 90.56, 88.08, 87.63, 88.12, 152.92, 90.75, 90.73, 93.93, 92.66, 92.67, 97.24, 65.40, 97.67, 103.66, 104.43, 105.25, 106.17, 105.01, 108.52, 109.33, 108.17, 107.10, 106.93, 111.25, 109.99, 107.23, 107.18, 108.30, 101.81, 99.47, 97.97, 96.05, 95.29]
It seems like your main problem here is going to be removing outliers. There are a couple of ways to do this, but for your application, your best bet is to probably just to remove items based on their distance from the median (Since the median is much less sensitive to outliers than the mean.)
If you're using numpy that would looks like this:
def remove_outliers(data_points, margin=1.5):
nd = np.abs(data_points - np.median(data_points))
s = nd/np.median(nd)
return data_points[s<margin]
After which you should run least squares.
If you're not using numpy you can do something similar with native python lists:
def median(points):
return sorted(points)[len(points)/2] # evaluates to an int in python2
def remove_outliers(data_points, margin=1.5):
m = median(data_points)
centered_points = [abs(point - m) for point in data_points]
centered_median = median(centered_points)
ratios = [datum/centered_median for datum in centered_points]
return [point for i, point in enumerate(data_points) if ratios[i]>margin]
If you're looking to just not count outliers as highly you can just calculate the mean of your dataset, which is just a linear equivalent of the least-squares optimization.
If you're looking for something a little better I might suggest throwing your data through some kind of low pass filter, but I don't think that's really needed here.
A low-pass filter would probably be the best, which you can do as follows: (Note, alpha is a number you will have to fiddle with to get your desired output.)
def low_pass(data, alpha):
new_data = [data[0]]
for i in range(1, len(data)):
new_data.append(alpha * data[i] + (1 - alpha) * new_data[i-1])
return new_data
At which point your least squares optimization should work fine.
Replying to your final question
Is there a way to optimize least deltas but not the least squares of delta in Python?
Yes, pick an optimization method (for example downhill simplex implemented in scipy.optimize.fmin) and use the sum of absolute deviations as a merit function. Your dataset is small, I suppose that any general purpose optimization method will converge quickly. (In case of non-linear least squares fitting it is also possible to use general purpose optimization algorithm, but it's more common to use the Levenberg-Marquardt algorithm which minimizes sums of squares.)
If you are interested when minimizing absolute deviations instead of squares has theoretical justification see Numerical Recipes, chapter Robust Estimation.
From practical side, the sum of absolute deviations may not have unique minimum.
In the trivial case of two points, say, (0,5) and (1,9) and constant function y=a, any value of a between 5 and 9 gives the same sum (4). There is no such problem when deviations are squared.
If minimizing absolute deviations would not work, you may consider heuristic procedure to identify and remove outliers. Such as RANSAC or ROUT.
Given data values of some real-time physical process (e.g. network traffic) it is to find a name of the function which "at best" matches with the data.
I have a set of functions of type y=f(t) where y and t are real:
funcs = set([cos, tan, exp, log])
and a list of data values:
vals = [59874.141, 192754.791, 342413.392, 1102604.284, 3299017.372]
What is the simplest possible method to find a function from given set which will generate the very similar values?
PS: t is increasing starting from some positive value by almost-equal intervals
Just write the error ( quadratic sum of error at each point for instance ) for each function of the set and choose the function giving the minimum.
But you should still fit each function before choosing
Scipy has functions for fitting data, but they use polynomes or splines. You can use one of Gauß' many discoveries, the method of least squares to fit other functions.
I would try an approach based on fitting too. For each of the four test functions (f1-4 see below), the values of a and b that minimizes the squared error.
f1(t) = a*cos(b*t)
f2(t) = a*tan(b*t)
f3(t) = a*exp(b*t)
f4(t) = a*log(b*t)
After fitting the squared error of the four functions can be used for evaluating the fit goodness (low values means a good fit).
If fitting is not allowed at all, the four functions can be divided into two distinct subgroups, repeating functions (cos and tan) and strict increasing functions (exp and log).
Strict increasing functions can be identified by checking if all the given values are increasing throughout the measuring interval.
In pseudo code an algorithm could be structured like
if(vals are strictly increasing):
% Exp or log
if(increasing more rapidly at the end of the interval):
% exp detected
else:
% log detected
else:
% tan or cos
if(large changes in vals over a short period is detected):
% tan detected
else:
% cos detected
Be aware that this method is not that stable and will be easy to trick into faulty conclusions.
See Curve Fitting