I have a sample of example of two data as below;
If I was to plot the two, Data A would have a much smoother line graph and data B would have more spikey graph. How can I use pandas to sort of deternimne/differentiate the smoothness of dataset e.g with a calculation on the data and giving it an index which I can equate to the smoothness f the data. I looked for a solution and there was a suggestion using difference of Standard deviation. This was based on R. Any ideas on this? What sort of calculation would give me what i want? Can anyone point me in the right direction?
Standard deviation doesn't necessarily mean smoothness in the sense you seem to mean. A straight line graph (y=x) A:1 B:2 C:3 D:4 would be smooth for what you mean right? Whereas A:4 B:1 C:3 B:2 would not (it would go up and down/change direction). I think what you are looking for is a change of slope calculation (derivative of the function at different points or gradient).
In this case it's actually quite simple. Just calculate the sum of the absolute difference between each point. The one with the greatest total is more "spikey".
You can shift the data (pandas.shift), subtract the shift from the original, take the absolute value and then the sum.
Related
I have some datsets which plot curves similar to a gaussian curve, with a large peak at the center, and two almost flat parts at the sides. The problem is that some of these datasets contain experimental errors, which result in step-like formations at the sides, and this is a problem because I have to calculate the incremental ratio at each step in the dataset to find the max slope, and those steps give me a super high derivative, which in some cases are higher than the right peak.
Here is one of the problematic plots:
link
The blue line is the dataset, the red one the derivative, and the green dots are the peaks (for now just calculated as max(derivative) and min(derivative).
How can I filter out those peaks? I heard something about renormalization and that it could help with this, but I don't know what that is (I'm just a student and I'm doing this for my mom :) so I don't know what that means)
If your dataset is large enough, a simple median filter should do the job. Consider your dataset point ordered by the x value, and replace each value in the dataset with the median value of N consecutive point. If the errors are usually single points, even a filter with N=3 will work.
Median filter (Wiki)
I have the a dataframe which includes heights. The data can not go below zero. That's why i can not use standard deviation as this data is not a normal distribution. I can not use 68-95-99.7 rule here because it fails in my case. Here is my dataframe, mean and SD.
0.77132064
0.02075195
0.63364823
0.74880388
0.49850701
0.22479665
0.19806286
0.76053071
0.16911084
0.08833981
Mean: 0.41138725956196015
Std: 0.2860541519582141
If I get 2 std, as you can see the number becomes negative.
-2 x std calculation = 0.41138725956196015 - 0.2860541519582141 x 2 = -0,160721044354468
I have tried using percentile and not satisfied with it to be honest. How can i apply Chebyshev's inequality to this problem? Here what i did so far:
np.polynomial.Chebyshev(df['Heights'])
But this returns numbers not a SD level i can measure. Or do you think Chebyshev is the best choice in my case?
Expected solution:
I am expecting to get a range like 75% next height will be between 0.40 - 0.43 etc.
EDIT1: Added histogram
To be more clear, I have added my real data's histogram
EDIT2: Some values from real data
Mean: 0.007041500928135767
Percentile 50: 0.0052000000000000934
Percentile 90: 0.015500000000000047
Std: 0.0063790857035425025
Var: 4.06873389299246e-05
Thanks a lot
You seem to be confusing two ideas from the same mathematician, Chebyshev. These ideas are not the same.
Chebysev's inequality states a fact that is true for many probability distributions. For two standard deviations, it states that three-fourths of the data items will lie within two standard deviations from the mean. As you state, for normal distributions about 19/20 of the items will lie in that interval, but Chebyshev's inequality is an absolute bound that is met by practically all distributions. The fact that your data values are never negative does not change the truth of the inequality; it just makes the actual proportion of values in the interval even larger, so the inequality is even more true (in a sense).
Chebyshev polynomials do not involve statistics, but are simply a series (or two series) of polynomials, commonly used in calculating approximations for computer functions. That is what np.polynomial.Chebyshev involves, and therefore does not seem useful to you at all.
So calculate Chebyshev's inequality yourself. There is no need for a special function for that, since it is so easy (this is Python 3 code):
def Chebyshev_inequality(num_std_deviations):
return 1 - 1 / num_std_deviations**2
You can change that to handle the case where k <= 1 but the idea is obvious.
In your particular case: the inequality says that at least 3/4, or 75%, of the data items will lie within 2 standard deviations of the mean, which means more than 0.41138725956196015 - 2 * 0.2860541519582141 and less than than 0.41138725956196015 + 2 * 0.2860541519582141 (note the different signs), which simplifies to the interval
[-0.16072104435446805, 0.9834955634783884]
In your data, 100% of your data values are in that interval, so Chebyshev's inequality was correct (of course).
Now, if your goal is to predict or estimate where a certain percentile is, Chebyshev's inequality does not help much. It is an absolute lower bound, so it gives one limit to a percentile. For example, by what we did above we know that the 12.5'th percentile is at or above -0.16072104435446805 and the 87.5'th percentile is at or below 0.9834955634783884. Those facts are true but are probably not what you want. If you want an estimate that is closer to the actual percentile, this is not the way to go. The 68-95-99.7 rule is an estimate--the actual locations may be higher or lower, but if the distribution is normal than the estimate will not be far off. Chebyshev's inequality does not do that kind of estimate.
If you want to estimate the 12.5'th and 87.5'th percentiles (showing where 75 percent of all the population will fall) you should calculate those percentiles of your sample and use those values. If you don't know more details about the kind of distribution you have, I don't see any better way. There are reasons why normal distributions are so popular!
It sounds like you want the boundaries for the middle 75% of your data.
The middle 75% of the data is between the 12.5th percentile and the 87.5th percentile, so you can use the quantile function to get the values at the locations:
[df['Heights'].quantile(0.5 - 0.75/2), df['Heights'].quantile(0.5 + 0.75/2)]
#[0.09843618875, 0.75906485625]
As per What does it mean when the standard deviation is higher than the mean? What does that tell you about the data? - Quora, SD is a measure of "spread" and mean is a measure of "position". As you can see, these are more or less independent things. Now, if all your samples are positive, SD cannot be greater than the mean because of the way it's calculated, but 2 or 3 SDs very well can.
So, basically, SD being roughly equal to the mean means that your data are all over the place.
Now, a random variable that's strictly positive indeed cannot be normally distributed. But for a rough estimation, seeing that you still have a bell shape, we can pretend it is and still use SD as a rough measure of the spread (though, since 2 and 3 SD can go into negatives, they lack any physical meaning here whatsoever and so are unusable for the sake of our pretention):
E.g. to get a rough prediction of grass growth, you can still take the mean and apply whatever growth model you're using to it -- that will get the new, prospective mean. Then applying the same to meanĀ±SD will give an idea of the new SD.
This is very rough, of course. But to get any better, you'll have to somehow check which distribution you're dealing with and use its peak and spread characteristics instead of mean and SD. And in any case, your prediction will not be any better than your growth model -- studies of which are anything but conclusive judging by e.g. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-3040.2005.01490.x (not a single formula there).
I'm a PhD student in sociology working on my dissertation. In the course of some data analysis, I have bumped up against the following problem.
I have a table of measured values of some variable over a series of years. The values count, "how many events of a certain type there are in a given year"? Here is a sample of what it looks like:
year var
1983 22
1984 55
1985 34
1986 29
1987 15
1988 20
1989 41
So, e.g. in 1984, 55 such events occured over the whole year.
One way to represent this data over the domain of real numbers in [1983, 1990) is with a piecewise function f:
f(x) = var if floor(x) == year, for all x in [1983, 1990).
This function plots a series of horizontal lines of width 1, mapping a bar chart of the variable. The area under each of these lines is equal to the variable's value in that year. However, for this variable, I know that in each year, the rate is not constant over the whole year. In other words, the events don't suddenly jump from one yearly rate to another rate overnight on Dec 31, as the (discontinuous) function f seems to present. I don't know exactly how the rate changes, but I'd like to assume a smooth transition from year to year.
So, what I want is a function g which is both continuous and smooth (continuously differentiable) over the domain [1983, 1990), which also preserves the yearly totals. That is, the definite integral of g from 1984 to 1985 must still equal 55, and same for all other years. (So, for example, an n-degree polynomial which hits all the midpoints of the bars will NOT work.) Also, I'd like g to be a piecewise function, with all the pieces relatively simple -- quadratics would be best, or a sinusoid.
In sum: I want g to be a series of parabolas defined over each year, which smoothly transition from one to the other (left and right limits of g'(x) should be equal at the year boundaries), and where the area under each parabola is equal to the totals given by my data above.
I've drawn a crude version of what I want here. The cartoon uses the same data as above, with the black curve representing my hoped-for function, g. Toward the right end things got particularly bad, esp 1988 and 1989. But it's just meant to show a picture of what I would like to end up with.
Thanks for your help, or for pointing me towards other resources you think might be helpful!
PS I have looked at this paper which is linked inside this question. I agree with the authors (see section 4) that if I could replace my data with pseudodata d' using matrix A, from which I could very simply generate some sort of smooth function, that would be great, but they do not say how A could be obtained. Just some food for thought. Thanks again!
PPS What I need is a reliable method of generating g, given ANY data table as above. I actually have hundreds of these kinds of yearly count data, so I need a general solution.
You need the integral of your curve to go through a specific set of points, defined by the cumulative totals, so...
Interpolate between the cumulative totals to get an integral curve, and then
take the derivative of that to get the function you're looking for.
Since you want your function to be "continuous and smooth", i.e., C1-continuous, the integral curve you interpolate needs to be C2-continuous, i.e., it has to have continuous first and second derivatives. You can use polynomial interpolation, sinc interpolation, splines of sufficient degree, etc.
Using "natural" cubic splines to interpolate the integral will give you a piece-wise quadratic derivative that seems to satisfy all your requirements.
There's a pretty good description of the natural cubic splines here: http://mathworld.wolfram.com/CubicSpline.html
If your goal is to transform discrete data into a continuous representation, I would recommend looking up Kernel Density Estimation. KDE essentially models each data point as a (usually) Gaussian distribution and sums up the distribution, resulting in a smooth continuous distribution. This blog does a very thorough treatment of KDE using the SciPy module.
One of the downsides of KDE is that it does not provide an analytic solution. If that is your goal, I would recommend looking up polynomial regression.
I have a set of measured radii (t+epsilon+error) at an equally spaced angles.
The model is circle of radius (R) with center at (r, Alpha) with added small noise and some random error values which are much bigger than noise.
The problem is to find the center of the circle model (r,Alpha) and the radius of the circle (R). But it should not be too much sensitive to random error (in below data points at 7 and 14).
Some radii could be missing therefore the simple mean would not work here.
I tried least square optimization but it significantly reacts on error.
Is there a way to optimize least deltas but not the least squares of delta in Python?
Model:
n=36
R=100
r=10
Alpha=2*Pi/6
Data points:
[95.85, 92.66, 94.14, 90.56, 88.08, 87.63, 88.12, 152.92, 90.75, 90.73, 93.93, 92.66, 92.67, 97.24, 65.40, 97.67, 103.66, 104.43, 105.25, 106.17, 105.01, 108.52, 109.33, 108.17, 107.10, 106.93, 111.25, 109.99, 107.23, 107.18, 108.30, 101.81, 99.47, 97.97, 96.05, 95.29]
It seems like your main problem here is going to be removing outliers. There are a couple of ways to do this, but for your application, your best bet is to probably just to remove items based on their distance from the median (Since the median is much less sensitive to outliers than the mean.)
If you're using numpy that would looks like this:
def remove_outliers(data_points, margin=1.5):
nd = np.abs(data_points - np.median(data_points))
s = nd/np.median(nd)
return data_points[s<margin]
After which you should run least squares.
If you're not using numpy you can do something similar with native python lists:
def median(points):
return sorted(points)[len(points)/2] # evaluates to an int in python2
def remove_outliers(data_points, margin=1.5):
m = median(data_points)
centered_points = [abs(point - m) for point in data_points]
centered_median = median(centered_points)
ratios = [datum/centered_median for datum in centered_points]
return [point for i, point in enumerate(data_points) if ratios[i]>margin]
If you're looking to just not count outliers as highly you can just calculate the mean of your dataset, which is just a linear equivalent of the least-squares optimization.
If you're looking for something a little better I might suggest throwing your data through some kind of low pass filter, but I don't think that's really needed here.
A low-pass filter would probably be the best, which you can do as follows: (Note, alpha is a number you will have to fiddle with to get your desired output.)
def low_pass(data, alpha):
new_data = [data[0]]
for i in range(1, len(data)):
new_data.append(alpha * data[i] + (1 - alpha) * new_data[i-1])
return new_data
At which point your least squares optimization should work fine.
Replying to your final question
Is there a way to optimize least deltas but not the least squares of delta in Python?
Yes, pick an optimization method (for example downhill simplex implemented in scipy.optimize.fmin) and use the sum of absolute deviations as a merit function. Your dataset is small, I suppose that any general purpose optimization method will converge quickly. (In case of non-linear least squares fitting it is also possible to use general purpose optimization algorithm, but it's more common to use the Levenberg-Marquardt algorithm which minimizes sums of squares.)
If you are interested when minimizing absolute deviations instead of squares has theoretical justification see Numerical Recipes, chapter Robust Estimation.
From practical side, the sum of absolute deviations may not have unique minimum.
In the trivial case of two points, say, (0,5) and (1,9) and constant function y=a, any value of a between 5 and 9 gives the same sum (4). There is no such problem when deviations are squared.
If minimizing absolute deviations would not work, you may consider heuristic procedure to identify and remove outliers. Such as RANSAC or ROUT.
I have been trying to calculate an autocorrelation function, as defined in statistical mechanics, using numpy. Most of the documentation I found is relative to functions like correlate and convolve. However, for a given random variable x these functions just seem to calculate the sum
ACF(dt) = sum_{t=0}^T [(x(t)*x(t+dt)]
instead of the average
ACF(dt) = mean[x(t)*x(t+dt)]
so in fact for calculating an autocorrelation function one would need to do something like:
acf = np.correlate(x,x,mode='full')
acf_half = acf[acf.size / 2:]
ldata = len(acf)
acf = np.array([x/(ldata-i) for i,x in enumerate(acf_half)])
Of course we would need to subtract mean(x)**2 from the resulting acf to be correct.
Can anyone confirm that this is correct?
Generally speaking, the autocorrelation, correlation, etc. is the sum (integral). Sometimes it is normalized, but not averaged in the sense as you've written above. This is because they are defined in terms of the mathematical convolution operation, which is simply the integral that you've written as a sum above.
The brackets at the stat mech page indicate a thermal average, which is an ensemble or time average over the 'experiment' taking place many times at many different states at some temperature. This (the finite temperature) causes the fluctuations that give rise to the 'statistical' nature of the problem, and cause the decay of the correlation (loss of long range order). This simply means that you should find the autocorrelation of several datasets, and average those together, but do not take the mean of the function.
As far as I can tell, your code is attempting to weigh the correlation at dt by the length of the overlap length dt, but I do not believe that this is correct.
With respect to the subtraction of <s>2, that's in the case of the spin model, where <s> would be the mean spin (magnetization), so I believe you are correct in that you should use mean(x)**2.
As a side-note, I would suggest using mode='same' instead of 'full' so that the domain of your correlation matches the domain of your input without having to look at just one-half of the output (here the output is symmetric, so it doesn't really make a difference).