What is an efficient method for determining the skew/kurtosis of a bar graph in python? Considering that bar graphs are not binned (unlike histograms) this question would not make a lot of sense but what I am trying to do is to determine the symmetry of a graph's height vs distance (rather than frequency vs bins). In other words, given a value of heights(y) measured along distance(x) i.e.
y = [6.18, 10.23, 33.15, 55.25, 84.19, 91.09, 106.6, 105.63, 114.26, 134.24, 137.44, 144.61, 143.14, 150.73, 156.44, 155.71, 145.88, 120.77, 99.81, 85.81, 55.81, 49.81, 37.81, 25.81, 5.81]
x = [0.03, 0.08, 0.14, 0.2, 0.25, 0.31, 0.36, 0.42, 0.48, 0.53, 0.59, 0.64, 0.7, 0.76, 0.81, 0.87, 0.92, 0.98, 1.04, 1.09, 1.15, 1.2, 1.26, 1.32, 1.37]
What is the symmetry of that height(y) distribution (skewness) and peakness (kurtosis) as measured over distance(x)? Are skewness/kurtosis appropriate measurements for determining the normal distribution of real values? Or does scipy/numpy offer something similar for that type of measurement?
I can achieve a skew/kurtosis estimate of height(y) frequency values binned along distance(x) by the following
freq=list(chain(*[[x_v]*int(round(y_v)) for x_v,y_v in zip(x,y)]))
x.extend([x[-1:][0]+x[0]]) #add one extra bin edge
hist(freq,bins=x)
ylabel("Height Frequency")
xlabel("Distance(km) Bins")
print "Skewness,","Kurtosis:",stats.describe(freq)[4:]
Skewness, Kurtosis: (-0.019354300509997705, -0.7447085398785758)
In this case the height distribution is symmetrical (skew 0.02) around the midpoint distance and characterized by a platykurtic (-0.74 kurtosis i.e. broad) distribution.
Considering that I multiply each occurrence of x value by their height y to create a frequency, the size of the result list can sometimes get very large. I was wondering if there was a better method to approach this problem? I suppose that I could always try to normalize dataset y to a range of perhaps 0 - 100 without loosing too much information on the datasets skew/kurtosis.
This isn't a python question, nor is it really a programming question but the answer is simple nonetheless. Instead of skew and kurtosis, let's first consider the easier values based off the lower moments, the mean and standard deviation. To make it concrete, and to fit with your question, let's assume your data looks like:
X = 3, 3, 5, 5, 5, 7 = x1, x2, x3 ....
Which would give a "bar graph" that looks like:
{3:2, 5:3, 7:1} = {k1:p1, k2:p2, k3:p3}
The mean, u, is given by
E[X] = (1/N) * (x1 + x2 + x3 + ...) = (1/N) * (3 + 3 + 5 + ...)
Our data, however, has repeated values, so this can be rewritten as
E[X] = (1/N) * (p1*k1 + p2*k2 + ...) = (1/N) * (3*2 + 5*3 + 7*1)
The next term, the standard dev., s, is simply
sqrt(E[(X-u)^2]) = sqrt((1/N)*( (x1-u)^2 + (x2-u)^3 + ...))
But we can apply the same reduction to the E[(X-u)^2] term and write it as
E[(X-u)^2] = (1/N)*( p1*(k1-u)^2 + p2*(k2-u)^2 + ... )
= (1/6)*( 2*(3-u)^2 + 3*(5-u)^2 + 1*(7-u)^2 )
Which means we don't have to have a multiple copy of each data item to do the sum as you indicated in your question.
The skew and kurtosis are quite simple as this point:
skew = E[(x-u)^3] / (E[(x-u)^2])^(3/2)
kurtosis = ( E[(x-u)^4] / (E[(x-u)^2])^2 ) - 3
Related
I have a set of x and y data and I want to use exponential regression to find the line that best fits those set of points. i.e.:
y = P1 + P2 exp(-P0 x)
I want to calculate the values of P0, P1 and P2.
I use a software "Igor Pro" that calculates the values for me, but want a Python implementation. I used the curve_fit function, but the values that I get are nowhere near the ones calculated by Igor software. Here is the sets of data that I have:
Set1:
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
Values calculated by Igor:
P1=376.91, P2=5393.9, P0=3.7776
Values calculated by curve_fit:
P1=702.45, P2=-13.33. P0=-2.6744
Set2:
x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
Values calculated by Igor:
P1=321, P2=4848, P0=-1.94
Values calculated by curve_fit:
No optimal values found
I use curve_fit as follow:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(lambda t, a, b, c: a * np.exp(-b * t) + c, x, y)
where:
P1=c, P2=a and P0=b
Well, when comparing fit results, it is always important to include uncertainties in the fitted parameters. That is, when you say that the values
from Igor (P1=376.91, P2=5393.9, P0=3.7776), and from curve_fit
(P1=702.45, P2=-13.33. P0=-2.6744) are different, what is it that leads to conclude those values are actually different?
Of course, in everyday conversation, 376.91 and 702.45 are very different, mostly because simply stating a value to 2 decimal places implies accuracy at approximately that scale (the distance between New York and Tokyo is
10,850 km but is not really 10,847,024,31 cm -- that might be the distance between bus stops in the two cities). But when comparing fit results, that everyday knowledge cannot be assumed, and you have to include uncertainties. I don't know if Igor will give you those. scipy curve_fit can, but it requires some work to extract them -- a pity.
Allow me to recommend trying lmfit (disclaimer: I am an author). With that, you would set up and execute the fit like this:
import numpy as np
from lmfit import Model
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
# x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
# y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
# Define the function that we want to fit to the data
def func(x, offset, scale, decay):
return offset + scale * np.exp(-decay* x)
model = Model(func)
params = model.make_params(offset=375, scale=5000, decay=4)
result = model.fit(y, params, x=x)
print(result.fit_report())
This would print out the result of
[[Model]]
Model(func)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 49
# data points = 9
# variables = 3
chi-square = 72.2604167
reduced chi-square = 12.0434028
Akaike info crit = 24.7474672
Bayesian info crit = 25.3391410
R-squared = 0.99362489
[[Variables]]
offset: 413.168769 +/- 17348030.9 (4198775.95%) (init = 375)
scale: 16689.6793 +/- 1.3337e+10 (79909638.11%) (init = 5000)
decay: 5.27555726 +/- 1016721.11 (19272297.84%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, decay) = 1.000
C(offset, decay) = 1.000
C(offset, scale) = 1.000
indicating that the uncertainties in the parameter values are simply enormous and the correlations between all parameters are 1. This is because you have only 2 x values, which will make it impossible to accurately determine 3 independent variables.
And, note that with an uncertainty of 17 million, the values for P1 (offset) of 413 and 762 do actually agree. The problem is not that Igor and curve_fit disagree on the best value, it is that neither can determine the value with any accuracy at all.
For your other dataset, the situation is a little better, with a result:
[[Model]]
Model(func)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 82
# data points = 9
# variables = 3
chi-square = 1118.19957
reduced chi-square = 186.366596
Akaike info crit = 49.4002551
Bayesian info crit = 49.9919289
R-squared = 0.98272310
[[Variables]]
offset: 320.876843 +/- 42.0154403 (13.09%) (init = 375)
scale: 4797.14487 +/- 2667.40083 (55.60%) (init = 5000)
decay: 1.93560164 +/- 0.47764470 (24.68%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, decay) = 0.995
C(offset, decay) = 0.940
C(offset, scale) = 0.904
the correlations are still high, but the parameters are reasonably well determined. Also, note that the best-fit values here are much closer to those you got from Igor, and probably "within the uncertainty".
And this is why one always needs to include uncertainties with the best-fit values reported from a fit.
Set 1 :
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
One observe that they are only two different values of x : 1.06 and 0.91
On the other hand they are three parameters to optimise : P0, P1 and P2. This is too much.
In other words an infinity of exponential curves can be found to fit the two clusters of points. The differences between the curves can be due to slight difference of the computation methods of non-linear regression especially due to the methods to chose the initial values of the iterative process.
In this particular case a simple linear regression would be without ambiguity.
By comparison :
Thus both Igor and Curve_fit give excellent fitting : The points are very close to both curves. One understand that infinity many other exponential fuctions would fit as well.
Set 2 :
x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
The difficulty that you meet might be due to the choice of "guessed" initial values of the parameters which are required to start the iterative process of nonlinear regression.
In order to check this hypothesis one can use a different method which doesn't need initial guessed values. The MathCad code and numerical calculus are shown below.
Don't be surprised if the values of the parameters that you get with your software are slightly different from the above values (a, b, c). The criteria of fitting implicitly set in your software is probably different from the criteria of fitting set in my software.
Blue curve : The method of regression is a Least Mean Square Errors wrt a linear integral equation to which the exponential equation is solution. Ref.: https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
This non-standard method isn't iterative and doesn't require initial "guessed" values of parameters.
I have a dataset of pictures as tensors with each pixel having a value between 0 and 1, and I have a set of "bins."
bins = [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]
I want to return a tensor with each pixel value being its nearest bin. As in, if a pixel is 0.03 it will turn into 0.05, if a pixel is 0.79 it will turn into 0.75.
I want this to be done with tensors not numpy.
Here is it working in numpy... tensor flow however seems to be a whole different beast when it comes to iterating. I have tried tf.map_fn and tf.scan to iterate through but I couldn't get it to work.
def valueQuant(picture, splitSize):
#This is the Picture that will be returned
Quant_Pic = np.zeros((picture.shape[0], picture.shape[1]))
#go through each pixel of the image
for y_col in range(picture.shape[0]):
for x_row in range(picture.shape[1]):
#isolate regions based on value
for i in range(splitSize):
#low and high values to isolate
lowFloatRange = float((1/splitSize)*i)
highFloatRange = float((1/splitSize)*(i+1))
#value to turn entire clustor
midRange = lowFloatRange + ((highFloatRange - lowFloatRange)/2)
#current value of current pixel
curVal = picture[y_col][x_row]
#if the current value is within the range of interest
if(curVal >= lowFloatRange and curVal <= highFloatRange):
Quant_Pic[y_col][x_row] = midRange
return Quant_Pic
I was able to figure out an element wise method using only tensor flow methods.
def quant_val(current_input):
bins = tf.constant([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95])
dist = tf.tile(current_input, [10])
dist = tf.math.subtract(bins, current_input)
absDist = tf.math.abs(dist)
idx = tf.math.argmin(absDist)
output = bins[idx]
output = tf.expand_dims(output, 0)
print("output", output)
return output
current_input = tf.constant([0.53])
quant_val(current_input)
This is able to return the right answer for a tensor with a single value, but I am unsure how to extrapolate this to the larger image tensor structure. Any help would be much appreciated!!! Thank you oh kind wise ones.
Round approach:
This is very simple and easy, but some .5 values are round up, others down. If this is not a problem:
def quant_val(images): #0 to 1
images = (images - 0.05) * 10 #-0.5 to 9.5
bins = tf.round(images) #0 to 9
bins = tf.clip_by_value(bins, 0, 9) #possible -1 and 10 due to the remark on top
return (bins/10) + 0.05 #0.05 to 0.95
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
observed = [0.294, 0.2955, 0.235, 0.2536, 0.2423, 0.2844, 0.2099, 0.2355, 0.2946, 0.3388, 0.2202, 0.2523, 0.2209, 0.2707, 0.1885, 0.2414, 0.2846, 0.328, 0.2265, 0.2563, 0.2345, 0.2845, 0.1787, 0.2392, 0.2777, 0.3076, 0.2108, 0.2477, 0.234, 0.2696, 0.1839, 0.2344, 0.2872, 0.3224, 0.2152, 0.2593, 0.2295, 0.2702, 0.1876, 0.2331, 0.2809, 0.3316, 0.2099, 0.2814, 0.2174, 0.2516, 0.2029, 0.2282, 0.2697, 0.3424, 0.2259, 0.2626, 0.2187, 0.2502, 0.2161, 0.2194, 0.2628, 0.3296, 0.2323, 0.2557, 0.2215, 0.2383, 0.2166, 0.2315, 0.2757, 0.3163, 0.2311, 0.2479, 0.2199, 0.2418, 0.1938, 0.2394, 0.2718, 0.3297, 0.2346, 0.2523, 0.2262, 0.2481, 0.2118, 0.241, 0.271, 0.3525, 0.2323, 0.2513, 0.2313, 0.2476, 0.232, 0.2295, 0.2645, 0.3386, 0.2334, 0.2631, 0.226, 0.2603, 0.2334, 0.2375, 0.2744, 0.3491, 0.2052, 0.2473, 0.228, 0.2448, 0.2189, 0.2149]
a, b, loc, scale = stats.beta.fit(observed,floc=0,fscale=1)
ax = plt.subplot(111)
ax.hist(observed, alpha=0.75, color='green', bins=104, density=True)
ax.plot(np.linspace(0, 1, 100), stats.beta.pdf(np.linspace(0, 1, 100), a, b))
plt.show()
The α and β is out of whack (α=6.056697373013153,β=409078.57804704335)
The fitting image is also unreasonable. Histograms and beta distributions differ in height on the Y-axis.
The data of average is about 0.25, but calculated according to the expected value of beta distribution, 6.05/(6.05+409078.57)=1.47891162469e-05.This seems counterintuitive.
I think you are messing up a bit the code with whatever your observation is.
The main point to consider is that your beta fit will have both a and b, as well as loc and scale.
If you perform your fit using fixed loc/scale, i.e. scipy.stats.beta.fit(observed, floc=0, fscale=1), then your fitted a and b are: a = 33.26401059422594 and b = 99.0180817184922.
On the other hand, if you perform your fit with variable loc and scale, i.e. scipy.stats.beta.fit(observed), then you must compute / consider scipy.stats.beta.pdf() to include also those as parameter, which are, with your data, a = 6.056697380819225, b = 409078.5780469263, loc = 0.15710752697400227, scale = 6373.831662619217.
According to its documentation, the probability density above is defined in the “standardized” form. To shift and/or scale the distribution use the loc and scale parameters. Specifically, beta.pdf(x, a, b, loc, scale) is identically equivalent to beta.pdf(y, a, b) / scale with y = (x - loc) / scale.
Hence, the theoretical mean/average should be modified accordingly to include the scaling and location transformations.
I am aiming to show the accuracy of a numerical solution and how this varies with the value of timestep chosen. The numerical solution is produced using the following code:
def f(te3):
y3 = -r3*(te3 - te_surr) #y is the derivative
return y3
for i in range(1, len(t3)):
te3[i] = te3[i-1] + f(te3[i-1])*dt
These numerical solutions are then plotted:
plt.plot(t3,te3)
Originally, dt was chosen to be 0.1. I am trying to show the various plots produced for different values of timesteps: 0.05, 0.01, etc. However I am unsure how to implement this into my code other than manually typing out each value of dt,
dt2 = 0.05
dt3 = 0.025
dt4 = 0.01
dt5 = 0.005
dt6 = 0.001
then changing the code shown above for each dt value and so forth. Is there a way I can store these values as a list or an array and use this to plot the values?
Maybe you can use a dictionary to hold te3 for each dt. For example:
dt_values = [0.05, 0.025, 0.1, 0.05, 0.001]
my_te3 = {0.05:[1,2,3], 0.025:[1,2,3]}
for i, dt in zip(range(1, len(my_te3[0.05])), dt_values):
my_te3[dt][i] = my_te3[dt][i-1] + f(my_te3[dt][i-1])*dt
Then, to plot, you just need to loop the keys:
for te3 in my_te3.itervalues():
plt.plot(t3,te3)
Note that itervalues only works in Python 2.7. For Python 3, use .values instead.
I am working on moving some code from IDL into python. One IDL call is to INT_TABULATE which performs integration on a fixed range.
The INT_TABULATED function integrates a tabulated set of data { xi , fi } on the closed interval [MIN(x) , MAX(x)], using a five-point Newton-Cotes integration formula.
Result = INT_TABULATED( X, F [, /DOUBLE] [, /SORT] )
Where result is the area under the curve.
IDL DOCS
My question is, does Numpy/SciPy offer a similar form of integration? I see that [scipy.integrate.newton_cotes] exists, but it appears to return "weights and error coefficient for Newton-Cotes integration instead of area".
Scipy does not provide such a high order integrator for tabulated data by default. The closest you have available without coding it yourself is scipy.integrate.simps, which uses a 3 point Newton-Cotes method.
If you simply want to get comparable integration precision, you could split your x and f arrays into 5 point chunks and integrate them one at a time, using the weights returned by scipy.integrate.newton_cotes doing something along the lines of:
def idl_tabulate(x, f, p=5) :
def newton_cotes(x, f) :
if x.shape[0] < 2 :
return 0
rn = (x.shape[0] - 1) * (x - x[0]) / (x[-1] - x[0])
weights = scipy.integrate.newton_cotes(rn)[0]
return (x[-1] - x[0]) / (x.shape[0] - 1) * np.dot(weights, f)
ret = 0
for idx in xrange(0, x.shape[0], p - 1) :
ret += newton_cotes(x[idx:idx + p], f[idx:idx + p])
return ret
This does 5-point Newton-Cotes on all intervals, except perhaps the last, where it will do a Newton-Cotes of the number of points remaining. Unfortunately, this will not give you the same results as IDL_TABULATE because the internal methods are different:
Scipy calculates the weights for points not equally spaced using what seems like a least-sqaures fit, I don't fully understand what is going on, but the code is pure python, you can find it in your Scipy installation in file scipy\integrate\quadrature.py.
INT_TABULATED always performs 5-point Newton-Cotes on equispaced data. If the data are not equispaced, it builds an equispaced grid, using a cubic spline to interpolate the values at those points. You can check the code here.
For the example in the INT_TABULATED docstring, which is suppossed to return 1.6271 using the original code, and have an exact solution of 1.6405, the above function returns:
>>> x = np.array([0.0, 0.12, 0.22, 0.32, 0.36, 0.40, 0.44, 0.54, 0.64,
... 0.70, 0.80])
>>> f = np.array([0.200000, 1.30973, 1.30524, 1.74339, 2.07490, 2.45600,
... 2.84299, 3.50730, 3.18194, 2.36302, 0.231964])
>>> idl_tabulate(x, f)
1.641998154242472