IDL's INT_TABULATE - SciPy equivalent? - python

I am working on moving some code from IDL into python. One IDL call is to INT_TABULATE which performs integration on a fixed range.
The INT_TABULATED function integrates a tabulated set of data { xi , fi } on the closed interval [MIN(x) , MAX(x)], using a five-point Newton-Cotes integration formula.
Result = INT_TABULATED( X, F [, /DOUBLE] [, /SORT] )
Where result is the area under the curve.
IDL DOCS
My question is, does Numpy/SciPy offer a similar form of integration? I see that [scipy.integrate.newton_cotes] exists, but it appears to return "weights and error coefficient for Newton-Cotes integration instead of area".

Scipy does not provide such a high order integrator for tabulated data by default. The closest you have available without coding it yourself is scipy.integrate.simps, which uses a 3 point Newton-Cotes method.
If you simply want to get comparable integration precision, you could split your x and f arrays into 5 point chunks and integrate them one at a time, using the weights returned by scipy.integrate.newton_cotes doing something along the lines of:
def idl_tabulate(x, f, p=5) :
def newton_cotes(x, f) :
if x.shape[0] < 2 :
return 0
rn = (x.shape[0] - 1) * (x - x[0]) / (x[-1] - x[0])
weights = scipy.integrate.newton_cotes(rn)[0]
return (x[-1] - x[0]) / (x.shape[0] - 1) * np.dot(weights, f)
ret = 0
for idx in xrange(0, x.shape[0], p - 1) :
ret += newton_cotes(x[idx:idx + p], f[idx:idx + p])
return ret
This does 5-point Newton-Cotes on all intervals, except perhaps the last, where it will do a Newton-Cotes of the number of points remaining. Unfortunately, this will not give you the same results as IDL_TABULATE because the internal methods are different:
Scipy calculates the weights for points not equally spaced using what seems like a least-sqaures fit, I don't fully understand what is going on, but the code is pure python, you can find it in your Scipy installation in file scipy\integrate\quadrature.py.
INT_TABULATED always performs 5-point Newton-Cotes on equispaced data. If the data are not equispaced, it builds an equispaced grid, using a cubic spline to interpolate the values at those points. You can check the code here.
For the example in the INT_TABULATED docstring, which is suppossed to return 1.6271 using the original code, and have an exact solution of 1.6405, the above function returns:
>>> x = np.array([0.0, 0.12, 0.22, 0.32, 0.36, 0.40, 0.44, 0.54, 0.64,
... 0.70, 0.80])
>>> f = np.array([0.200000, 1.30973, 1.30524, 1.74339, 2.07490, 2.45600,
... 2.84299, 3.50730, 3.18194, 2.36302, 0.231964])
>>> idl_tabulate(x, f)
1.641998154242472

Related

Python weighted quantile as R wtd.quantile()

I want to convert the R package Hmisc::wtd.quantile() into python.
Here is the example in R:
I took this as reference and it seems that the logics are different than R:
# First function
def weighted_quantile(values, quantiles, sample_weight = None,
values_sorted = False, old_style = False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), 'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
# weighted_quantiles = np.cumsum(sample_weight)
# weighted_quantiles /= np.sum(sample_weight)
weighted_quantiles = np.cumsum(sample_weight)/np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
weighted_quantile(values = [0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319],
quantiles = np.arange(0, 1 + 1 / 5, 1 / 5),
sample_weight = [1,1,1,1,1])
>> array([0.2136325, 0.2136325, 0.4079128, 0.4890342, 0.5083345, 0.6197319])
# Second function
def weighted_percentile(data, weights, perc):
"""
perc : percentile in [0-1]!
"""
data = np.array(data)
weights = np.array(weights)
ix = np.argsort(data)
data = data[ix] # sort data
weights = weights[ix] # sort weights
cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
return np.interp(perc, cdf, data)
weighted_percentile([0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319], [1,1,1,1,1], np.arange(0, 1 + 1 / 5, 1 / 5))
>> array([0.2136325 , 0.31077265, 0.4484735 , 0.49868435, 0.5640332 ,
0.6197319 ])
Both are different with R. Any idea?
I am Python-illiterate, but from what I see and after some quick checks I can tell you the following.
Here you use uniform (sampling) weights, so you could also directly use the quantile() function. Not surprisingly, it gives the same results as wtd.quantile() with uniform weights:
x <- c(0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319)
n <- length(x)
x <- sort(x)
quantile(x, probs = seq(0,1,0.2))
# 0% 20% 40% 60% 80% 100%
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
The R quantile() function get the quantiles in a 'textbook' way, i.e. by determining the index i of the obs to use with i = q(n+1).
In your case:
seq(0,1,0.2)*(n+1)
# 0.0 1.2 2.4 3.6 4.8 6.0
Of course since you have 5 values/obs and you want quintiles, the indices are not integers. But you know for example that the first quintile (i = 1.2) lies between obs 1 and obs 2. More precisely, it is a linear combination of the two observations (the 'weights' are derived from the value of the index):
0.2*x[1] + 0.8*x[2]
# 0.3690567
You can do the same for the all the quintiles, on the basis of the indices:
q <-
c(min(x), ## 0: actually, the first obs
0.2*x[1] + 0.8*x[2], ## 1.2: quintile lies between obs 1 and 2
0.4*x[2] + 0.6*x[3], ## 2.4: quintile lies between obs 2 and 3
0.6*x[3] + 0.4*x[4], ## 3.6: quintile lies between obs 3 and 4
0.8*x[4] + 0.2*x[5], ## 4.8: quintile lies between obs 4 and 5
max(x) ## 6: actually, the last obs
)
q
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
You can see that you get exactly the output of quantile() and wtd.quantile().
If instead of 0.2*x[1] + 0.8*x[2] we consider the following:
0.5*x[1] + 0.5*x[2]
# 0.3107726
We get the output of your second Python function. It appears that your second function considers uniform 'weights' (obviously I am not talking about the sampling weights here) when combining the two observations. The issue (at least for the second Python function) seems to come from this. I know these are just insights, but I hope they will help.
EDIT: note that the difference between the two is not necessary an 'issue' with the python code. There are different quantile estimators (and their weighted versions) and the python functions could simply rely on a different estimator than Hmisc::wtd.quantile(). I think that the latter uses the weighted version of the Harrell-Davis quantile estimator. If you really want to implement this one, you should check the source code of Hmisc::wtd.quantile() and try to 'directly' translate this into Python.

How to fit a sine curve to a small dataset

I have been struggling for apparently no reason trying to fit a sin function to a small dataset that resembles a sinusoid. I've looked at many other questions and tried different libraries and can't seem to find any glaring mistake in my code. Also in many answers people are fitting a function onto data where y = f(x); but I'm retrieving both of my lists independently from stellar spectra.
These are the lists for reference:
time = np.array([2454294.5084288 , 2454298.37039515, 2454298.6022165 ,
2454299.34790096, 2454299.60750029, 2454300.35176022,
2454300.61361622, 2454301.36130122, 2454301.57111912,
2454301.57540159, 2454301.57978822, 2454301.5842906 ,
2454301.58873511, 2454302.38635047, 2454302.59553152,
2454303.41548415, 2454303.56765036, 2454303.61479213,
2454304.38528718, 2454305.54043812, 2454306.36761011,
2454306.58025083, 2454306.60772791, 2454307.36686591,
2454307.49460991, 2454307.58258509, 2454308.3698358 ,
2454308.59468672, 2454309.40004997, 2454309.51208756,
2454310.43078368, 2454310.6091061 , 2454311.40121502,
2454311.5702085 , 2454312.39758274, 2454312.54580053,
2454313.52984047, 2454313.61734047, 2454314.37609003,
2454315.56721061, 2454316.39218499, 2454316.5672538 ,
2454317.49410168, 2454317.6280825 , 2454318.32944441,
2454318.56913047])
velocities = np.array([-2.08468951, -2.26117398, -2.44703149, -2.10149768, -2.09835213,
-2.20540079, -2.4221183 , -2.1394637 , -2.0841663 , -2.2458154 ,
-2.06177386, -2.47993416, -2.13462117, -2.26602791, -2.47359571,
-2.19834895, -2.17976339, -2.37745005, -2.48849617, -2.15875901,
-2.27674409, -2.39054554, -2.34029665, -2.09267843, -2.20338104,
-2.49483926, -2.08860222, -2.26816951, -2.08516229, -2.34925637,
-2.09381667, -2.21849357, -2.43438148, -2.28439031, -2.43506056,
-2.16953358, -2.24405359, -2.10093237, -2.33155007, -2.37739938,
-2.42468714, -2.19635302, -2.368558 , -2.45959665, -2.13392004,
-2.25268181]
These are radial velocities of a star observed at different times. When plotted they look like this:
Plotted Data
This is then the code I'm using to fit a test sine on the data:
x = time
y = velocities
def sin_fit(x, A, w):
return A * np.sin(w * x)
popt, pcov = curve_fit(sin_fit,x,y) #try to calculate exoplanet parameters with these data
xfit = np.arange(min(x),max(x),0.1)
fit = sin_fit(xfit,*popt)
mod = plt.figure()
plt.xlabel("Time (G. Days)")
plt.ylabel("Radial Velocity")
plt.scatter(x,[i for i in y],color="b",label="Data")
plt.plot(x,[i for i in y],color="b",alpha=0.2)
plt.plot(xfit,fit,color="r",label="Model Fit")
plt.legend()
mod.savefig("Data with sin fit.png")
plt.show()
I thought this was right, and it seems right by looking at other answers, but then this is what I get:
Data with model sine
What am I doing wrong?
Thank you in advanceee
I guess it's due the sin_fit function is not able to fit the data at all. The sin function per default whirls around y=0 while your data whirls somewhere around y=-2.3.
I tried your code and extended the sin_fit with an offset, yielding way better results (althought looking not too perfect):
def sin_fit(x, A, w, offset):
return A * np.sin(w * x) + offset
with this the function has at least a chance to fit

Obtaining Legendre polynomial form once Legendre coefficients are determined

I have obtained the coefficients for the Legendre polynomial that best fits my data. Now I am needing to determine the value of that polynomial at each time-step of my data. I need to do this so that I can subtract the fit from my data. I have looked at the documentation for the Legendre module, and I'm not sure if I just don't understand my options or if there isn't a native tool in place for what I want. If my data-points were evenly spaced, linspace would be a good option, but that's not the case here. Does anyone have a suggestion for what to try?
For those who would like to demand a minimum working example of code, just use a random array, get the coefficients, and tell me from there how you would proceed. The values themselves don't matter. It's the technique that I'm asking about here. Thanks.
To simplify Ahmed's example
In [1]: from numpy.polynomial import Polynomial, Legendre
In [2]: p = Polynomial([0.5, 0.3, 0.1])
In [3]: x = np.random.rand(10) * 10
In [4]: y = p(x)
In [5]: pfit = Legendre.fit(x, y, 2)
In [6]: plot(*pfit.linspace())
Out[6]: [<matplotlib.lines.Line2D at 0x7f815364f310>]
In [7]: plot(x, y, 'o')
Out[7]: [<matplotlib.lines.Line2D at 0x7f81535d8bd0>]
The Legendre functions are scaled and offset, as the data should be confined to the interval [-1, 1] to get any advantage over the usual power basis. If you want the coefficients for plain old Legendre functions
In [8]: pfit.convert()
Out[8]: Legendre([ 0.53333333, 0.3 , 0.06666667], [-1., 1.], [-1., 1.])
But that isn't recommended.
Once you have a function, you can just generate a numpy array for the timepoints:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
... a,b,c = pfinal # obviously, for a*x^2 + b*x + c
... return (a*bins**2) + b*bins + c
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
It automatically evaluates it for each timepoint is in the numpy array.
Now all you have to do is rewrite mypolynomial to go from a simple quadratic example to a proper one for a Legendre polynomial. Treat the function as if it were evaluating a float to return the value, and when called on the numpy array it will automatically evaluate it for each value.
EDIT:
Let's say I wanted to generalize this to all standard polynomials:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
>>> hist = np.zeros((1, len(myarray))) # define blank return
... for i in range(len(pfinal)):
... # fixed a typo here, was pfinal[-i] which would give -0 rather than -1, since negative indexing starts at -1, not -0
... const = pfinal[-i-1] # negative index to go from 0 exponent to highest exponent
... hist += const*(bins**i)
... return hist
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
EDIT2: Typo fix
EDIT3:
#Ahmed is perfect right when he states Homer's rule is good for numerical stability. The implementation here would be as follows:
>>> def horner(coeffs, x):
... acc = 0
... for c in coeffs:
... acc = acc * x + c
... return acc
>>> horner((1,1,0), myarray)
array([ 2, 12, 56, 240, 272, 306, 380])
Slightly modified to keep the same argument order as before, from the code here:
http://rosettacode.org/wiki/Horner%27s_rule_for_polynomial_evaluation#Python
When you're using a nice library to fit polynomials, the library will in my experience usually have a function to evaluate them. So I think it is useful to know how you're generating these coefficients.
In the example below, I used two functions in numpy, legfit and legval which made it trivial to both fit and evaluate the Legendre polynomials without any need to invoke Horner's rule or do the bookkeeping yourself. (Though I do use Horner's rule to generate some example data.)
Here's a complete example where I generate some sparse data from a known polynomial, fit a Legendre polynomial to it, evaluate that polynomial on a dense grid, and plot. Note that the fitting and evaluating part takes three lines thanks to the numpy library doing all the heavy lifting.
It produces the following figure:
import numpy as np
### Setup code
def horner(coeffs, x):
"""Evaluate a polynomial at a point or array"""
acc = 0.0
for c in reversed(coeffs):
acc = acc * x + c
return acc
x = np.random.rand(10) * 10
true_coefs = [0.1, 0.3, 0.5]
y = horner(true_coefs, x)
### Fit and evaluate
legendre_coefs = np.polynomial.legendre.legfit(x, y, 2)
new_x = np.linspace(0, 10)
new_y = np.polynomial.legendre.legval(new_x, legendre_coefs)
### Plotting only
try:
import pylab
pylab.ion() # turn on interactive plotting
pylab.figure()
pylab.plot(x, y, 'o', new_x, new_y, '-')
pylab.xlabel('x')
pylab.ylabel('y')
pylab.title('Fitting Legendre polynomials and evaluating them')
pylab.legend(['original sparse data', 'fit'])
except:
print("Can't start plots.")

Estimate formants using LPC in Python

I'm new to signal processing (and numpy, scipy, and matlab for that matter). I'm trying to estimate vowel formants with LPC in Python by adapting this matlab code:
http://www.mathworks.com/help/signal/ug/formant-estimation-with-lpc-coefficients.html
Here is my code so far:
#!/usr/bin/env python
import sys
import numpy
import wave
import math
from scipy.signal import lfilter, hamming
from scikits.talkbox import lpc
"""
Estimate formants using LPC.
"""
def get_formants(file_path):
# Read from file.
spf = wave.open(file_path, 'r') # http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ae.wav
# Get file as numpy array.
x = spf.readframes(-1)
x = numpy.fromstring(x, 'Int16')
# Get Hamming window.
N = len(x)
w = numpy.hamming(N)
# Apply window and high pass filter.
x1 = x * w
x1 = lfilter([1., -0.63], 1, x1)
# Get LPC.
A, e, k = lpc(x1, 8)
# Get roots.
rts = numpy.roots(A)
rts = [r for r in rts if numpy.imag(r) >= 0]
# Get angles.
angz = numpy.arctan2(numpy.imag(rts), numpy.real(rts))
# Get frequencies.
Fs = spf.getframerate()
frqs = sorted(angz * (Fs / (2 * math.pi)))
return frqs
print get_formants(sys.argv[1])
Using this file as input, my script returns this list:
[682.18960189917243, 1886.3054773107765, 3518.8326108511073, 6524.8112723782951]
I didn't even get to the last steps where they filter the frequencies by bandwidth because the frequencies in the list aren't right. According to Praat, I should get something like this (this is the formant listing for the middle of the vowel):
Time_s F1_Hz F2_Hz F3_Hz F4_Hz
0.164969 731.914588 1737.980346 2115.510104 3191.775838
What am I doing wrong?
Thanks very much
UPDATE:
I changed this
x1 = lfilter([1., -0.63], 1, x1)
to
x1 = lfilter([1], [1., 0.63], x1)
as per Warren Weckesser's suggestion and am now getting
[631.44354635609318, 1815.8629524985781, 3421.8288991389031, 6667.5030877036006]
I feel like I'm missing something since F3 is very off.
UPDATE 2:
I realized that the order being passed to scikits.talkbox.lpc was off due to a difference in sampling frequency. Changed it to:
Fs = spf.getframerate()
ncoeff = 2 + Fs / 1000
A, e, k = lpc(x1, ncoeff)
Now I'm getting:
[257.86573127888488, 774.59006835496086, 1769.4624576002402, 2386.7093679399809, 3282.387975973973, 4413.0428174593926, 6060.8150432549655, 6503.3090645887842, 7266.5069407315023]
Much closer to Praat's estimation!
The problem had to do with the order being passed to the lpc function. 2 + fs / 1000 where fs is the sampling frequency is the rule of thumb according to:
http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html
I have not been able to get the results you expect, but I do notice two things which might cause some differences:
Your code uses [1, -0.63] where the MATLAB code from the link you provided has [1 0.63].
Your processing is being applied to the entire x vector at once instead of smaller segments of it (see where the MATLAB code does this: x = mtlb(I0:Iend); ).
Hope that helps.
There are at least two problems:
According to the link, the "pre-emphasis filter is a highpass all-pole (AR(1)) filter". The signs of the coefficients given there are correct: [1, 0.63]. If you use [1, -0.63], you get a lowpass filter.
You have the first two arguments to scipy.signal.lfilter reversed.
So, try changing this:
x1 = lfilter([1., -0.63], 1, x1)
to this:
x1 = lfilter([1.], [1., 0.63], x1)
I haven't tried running your code yet, so I don't know if those are the only problems.

Kurtosis,Skewness of a bar graph? - Python

What is an efficient method for determining the skew/kurtosis of a bar graph in python? Considering that bar graphs are not binned (unlike histograms) this question would not make a lot of sense but what I am trying to do is to determine the symmetry of a graph's height vs distance (rather than frequency vs bins). In other words, given a value of heights(y) measured along distance(x) i.e.
y = [6.18, 10.23, 33.15, 55.25, 84.19, 91.09, 106.6, 105.63, 114.26, 134.24, 137.44, 144.61, 143.14, 150.73, 156.44, 155.71, 145.88, 120.77, 99.81, 85.81, 55.81, 49.81, 37.81, 25.81, 5.81]
x = [0.03, 0.08, 0.14, 0.2, 0.25, 0.31, 0.36, 0.42, 0.48, 0.53, 0.59, 0.64, 0.7, 0.76, 0.81, 0.87, 0.92, 0.98, 1.04, 1.09, 1.15, 1.2, 1.26, 1.32, 1.37]
What is the symmetry of that height(y) distribution (skewness) and peakness (kurtosis) as measured over distance(x)? Are skewness/kurtosis appropriate measurements for determining the normal distribution of real values? Or does scipy/numpy offer something similar for that type of measurement?
I can achieve a skew/kurtosis estimate of height(y) frequency values binned along distance(x) by the following
freq=list(chain(*[[x_v]*int(round(y_v)) for x_v,y_v in zip(x,y)]))
x.extend([x[-1:][0]+x[0]]) #add one extra bin edge
hist(freq,bins=x)
ylabel("Height Frequency")
xlabel("Distance(km) Bins")
print "Skewness,","Kurtosis:",stats.describe(freq)[4:]
Skewness, Kurtosis: (-0.019354300509997705, -0.7447085398785758)
In this case the height distribution is symmetrical (skew 0.02) around the midpoint distance and characterized by a platykurtic (-0.74 kurtosis i.e. broad) distribution.
Considering that I multiply each occurrence of x value by their height y to create a frequency, the size of the result list can sometimes get very large. I was wondering if there was a better method to approach this problem? I suppose that I could always try to normalize dataset y to a range of perhaps 0 - 100 without loosing too much information on the datasets skew/kurtosis.
This isn't a python question, nor is it really a programming question but the answer is simple nonetheless. Instead of skew and kurtosis, let's first consider the easier values based off the lower moments, the mean and standard deviation. To make it concrete, and to fit with your question, let's assume your data looks like:
X = 3, 3, 5, 5, 5, 7 = x1, x2, x3 ....
Which would give a "bar graph" that looks like:
{3:2, 5:3, 7:1} = {k1:p1, k2:p2, k3:p3}
The mean, u, is given by
E[X] = (1/N) * (x1 + x2 + x3 + ...) = (1/N) * (3 + 3 + 5 + ...)
Our data, however, has repeated values, so this can be rewritten as
E[X] = (1/N) * (p1*k1 + p2*k2 + ...) = (1/N) * (3*2 + 5*3 + 7*1)
The next term, the standard dev., s, is simply
sqrt(E[(X-u)^2]) = sqrt((1/N)*( (x1-u)^2 + (x2-u)^3 + ...))
But we can apply the same reduction to the E[(X-u)^2] term and write it as
E[(X-u)^2] = (1/N)*( p1*(k1-u)^2 + p2*(k2-u)^2 + ... )
= (1/6)*( 2*(3-u)^2 + 3*(5-u)^2 + 1*(7-u)^2 )
Which means we don't have to have a multiple copy of each data item to do the sum as you indicated in your question.
The skew and kurtosis are quite simple as this point:
skew = E[(x-u)^3] / (E[(x-u)^2])^(3/2)
kurtosis = ( E[(x-u)^4] / (E[(x-u)^2])^2 ) - 3

Categories